Background & Summary

The soybean hawkmoth Clanis bilineata tsingtauica Mell, (Lepidoptera, Sphingidae, Clanis; CBT), an agricultural pest infesting soybean, is mainly distributed in China, Japan, and the Korean Peninsula1. The CBT larvae has five instars, and the fifth instar is the larval gluttonous stage2. In severe cases, the larvae can lead to only the stems remain of the plant, crop failure, or even plant death3 (Fig. 1).

Fig. 1
figure 1

Developmental stages of CBT and its damage to soybeans. (a) Egg. (b) Fifth instar larva. (c) Pupa. (d) Adult. (e) Damaged soybean leaves and low instar larvae on them. (f) Damaged soybean by fifth instar larvae in net room. (g) Harvested fifth instar larvae from artificial rearing.

Meanwhile, CBT has a long history of consumption as a crucial edible insect in China4. The 5th instar larval meat is used freeze-dried, fried, soup and canned5. The larvae of CBT are nutrient-rich and have abundant essential amino acids, which can be used as a high-quality protein source6. At present, CBT is mainly obtained through artificial rearing7. The artificial rearing of CBT has become a promising agricultural industry in China, with an annual production of 30,000 tons and an output value of nearly 620 million dollars8.

Sphingidae has about 1,500 insects worldwide9, and many of which are considered significant agricultural pests, such as the tobacco hornworm (Manduca sexta) and sweet potato hornworm (Agrius convolvuli). However, the genome of hawkmoth has been poorly studied. To date, genome assembly can been retrieved for only 14 species of Sphingidae (as of January 2024 from NCBI), including Hyles lineata (Macroglossinae)10, Hyles euphorbiae (Macroglossinae)11, Mimas tiliae (Smerinthinae)12, Deilephila porcellus (Macroglossinae)13, and M. sexta (Sphinginae)14.

In the present study, we assembled a chromosome-level genome of CBT for the first time using PacBio HiFi reads and Hi-C sequencing technologies. We annotated repeat elements, non-coding RNAs (ncRNAs), and protein-coding genes of this genome. Additionally, we performed chromosomal synteny analysis of the CBT genome with those of Bombyx mori and M. sexta. The high-quality genome of CBT is greatly helpful for understanding and conducting further study of utilization as edible insect, damage mechanism and relevant integrated pest management strategies of sphingid species.

Methods

Sample collection and sequencing

The sample of fifth instar CBT larvae were collected from soybean field, and its original population derived from Lianyungang, Jiangsu Province, China. Subsequently, larvae were placed in incubator with a temperature of 26 ± 1°C, relative humidity of 60% ± 10%, and photoperiod of 14 h L: 10 h D. After two days of starvation treatment, washed the larvae with distilled water and placed them in liquid nitrogen.

Genomic DNA from CBT was extracted using the CTAB method. According to the manufacturer’s instructions, a short-read library was constructed using the Agencourt AMPure XP-Medium kit with an insert size of 200‒400 bp and was sequenced on DNBSEQ-T7 platform. A PacBio HiFi library with an insert size of 15 Kb was constructed using the SMRTbell® Express Template Prep Kit 2.0. And HiFi library was sequenced on PacBio Sequel IIe platform. The Hi-C sequencing was carried out by digesting extracted DNA with the Mbol restriction enzyme on Illumina Xplus platform. Next-generation RNA-seq library was constructed using the VAHTS mRNA-seq v2 Library Prep Kit and also was sequenced on Illumina Xplus platform. The third-generation full-length RNA sequencing library of Oxford Nanopore Technologies (ONT) was constructed using the SQK-PCS109 + SQKPBK004 Kit by BenaGen (Wuhan, China) and sequenced on Oxford Nanopore PromethION platform. All library constructions and sequencing were completed by Berry Genomic (Beijing, China), expect the construction and sequencing of ONT RNA library. Finally, we obtained 30.59 Gb (64.07×) of Whole-Genome Sequencing (WGS) raw data, 36.70 Gb (76.87×) of HiFi data, 74.68 Gb (156.42×) of Hi-C data, 11.81 Gb of RNA-seq data, and 13.71 Gb of RNA-ONT data (Table 1) with high quality (Tables S1S5).

Table 1 Statistics of sequencing data of C. bilineata tsingtauica.

Genome assembly

We used pbccs v6.4.0 (https://github.com/PacificBiosciences/ccs) to filter low-quality HiFi reads below Q20 base quality. Then we used Hifiasm v0.19.615 with default parameters for the initial round of assembly and only retained contig assembly sequences with coverage depth exceeding 6×. Subsequently, Hi-C data and the YAHS v1.216 pipeline were utilized for anchoring contigs onto chromosomes and assembly. Hi-C data was quality controlled and aligned to the genome using chromap v0.2.517. Two rounds of scaffolding were performed using YAHS v1.2 with default parameters. The assembly results from the initial round of scaffolding were manually corrected using Juicebox v1.11.0818, then performed the second round of scaffolding. The sequencing coverage of each pseudochromosome was evaluated by SAMtools v1.108 (https://www.htslib.org). The Hi-C interaction heatmap reveals a remarkably high quality of scaffolding (Fig. 2). We used MMseq 2 v1319 to perform blastn-like searches to detect potential contaminants in the assembly based on the NCBI nt and UniVec databases. Minimap2 (https://github.com/lh3/minimap2) was used to align reads back to the genome assembly. Compleasm v0.2.420 based on insecta_odb10 dataset (n = 1,367 orthologues) and merqury v1.321 were respectively used to assess completeness of Benchmarking Universal Single-Copy Orthologues (BUSCO) and the single-base quality value (QV).

Fig. 2
figure 2

Hi-C interaction heatmap of C. bilineata tsingtauica.

Finally, we obtained the high-quality chromosome-level genome of CBT, with the genome size of 477.45 Mb and GC content of 38.55% (Table 2). The assembly included 66 contigs and 56 scaffolds, with both scaffold N50 and contig N50 lengths of 17.43 Mb. 475.61 Mb of contigs were anchored to 29 pseudochromosomes, with a rate of 99.61%. The BUSCO assessment of genome completeness was 99.49% (C), with only 0.15% duplicated BUSCOs (D), 0.15% fragmented BUSCOs (F), and 0.37% missing BUSCOs (M). The mapping rates for WGS, HiFi, RNA-seq, and RNA-ONT data were 97.86%, 99.90%, 95.78%, and 90.39%, respectively (Table 2). Chromosome 29 was the shortest, with a length of 8,386,962 bp, while chromosome 22 was the longest at 25,470,929 bp. The overall average length of the chromosomes was 16,400,209 bp. In terms of sequencing quality, the mean QV across all chromosomes was approximately 58, while the average sequencing coverage depth was about 70× for HiFi and 61× for WGS (Table 3). These indicators suggest that the assembly of CBT genome is of extremely high quality in terms of completeness and continuity. In addition, we found a complete mitochondrial whole genome sequence in the genome assembly, with a length of 15,417 bp and annotated by MitoZ v3.622 (Fig. S1).

Table 2 The chromosomal-level genome assembly statistics of C. bilineata tsingtauica.
Table 3 Genome assembly summary of length, sequencing coverage and QV value for each chromosome.

Genome annotation

We employed RepeatModeler v2.0.523 and the “LTRStruct” LTR discovery pipeline to construct a repeat library. This library was combined with the Dfam 3.724 and RepBase-2018102625 databases to form a custom library. Repeat elements were identified by aligning the genome with the custom library using RepeatMasker v4.1.526. The analysis revealed 252.16 Mb repeat elements, accounting for 52.81% of the genome. The major repeat elements included LINEs (14.73%), SINEs (14.56%), Unclassified (12.43%), LTRs (3.60%), Rolling-circles (3.42%), and DNA elements (3.06%) (Table 4; Table S6). Subsequently, Infernal v1.1.527 searched for non-coding RNAs based on Rfam database. And tRNAs were predicted using tRNAscan-SE v2.0.1228. Low-confidence tRNAs were filtered using the built-in ‘EukHighConfidenceFilter’ script. In total, we annotated 1,434 ncRNAs, mainly including 170 rRNAs, 74 miRNAs, 76 snRNAs, and 636 tRNAs (Table 4; Table S7). Moreover, genome characteristic visualization was performed with TBTools-II v2.04229 in combination with annotation (Fig. 3).

Table 4 Genome annotation statistics of C. bilineata tsingtauica.
Fig. 3
figure 3

Genome characteristics of C. bilineata tsingtauica (window size 100 kb). From the outer ring to the inner ring are the distributions of chromosome length, GC content, gene density, TE (DNA, SINE, LINE, and LTR).

Protein-coding genes were annotated using MAKER v3.01.0430 by integrating three strategies: ab initio prediction, transcriptome-based and homology-based prediction. BRAKER v3.0.631 and GeMoMa v1.932 were used to integrate transcriptome and protein evidence, with their prediction results combined as ab initio input file for MAKER. Transcriptome alignment BAM files were generated using HISAT2 v2.2.133. BRAKER automatically trained Augustus v3.4.034 and GeneMark-ETP35, and combined transcriptome data and arthropod homologous protein sequences from OrthoDB11 database36 to improve prediction accuracy. Additionally, homology-based prediction was performed using GeMoMa based on the annotation of genes of Drosophila melanogaster (Diptera), M. sexta (Lepidoptera), Amyelois transitella (Lepidoptera), B. mori (Lepidoptera), and Spodoptera frugiperda (Lepidoptera) from GenBank (Table 5). For transcriptome-based prediction approach, the transcriptome was assembled using StringTie v2.2.137, and BAM files were generated with HISAT2.

Table 5 Genome datasets were used for gene prediction based on homology in the study.

In the end, we predicted 14,214 protein-coding genes in the CBT genome by using MAKER, with an average gene length of 16,966.9 bp. The average number of exons, introns, and CDS of each gene were 7.7, 6.7, and 7.4, respectively (Table 4). The average length of exons, introns, and CDS of each gene were 314.6 bp, 2,347.9 bp, and 222.5 bp, respectively (Table 4). What’s more, BUSCO completeness of the predicted protein-coding gene sequences was 98.90%, including 77.47% single-copy, 21.43% duplicated, 0.07% fragmented, and 1.02% missing BUSCOs.

Functional annotation of the genes was performed using Diamond v2.1.7.16138 (-very-sensitive -e 1e-5) by searching against the UniProtKB v202305 database. For further gene functional annotation, InterPro 5.65–97.039 was used to search databases including Pfam40, SMART41, Superfamily42, and CDD43. The eggNOG v5.0.244 database (http://eggnog6.embl.de) was searched by eggNOG-mapper v2.1.1245. After integrating these results, we found that 13,889 (97.71%) genes were functionally annotated against the UniProtKB database. InterPro identified structural domains for 11,694 protein-coding genes. InterPro and eggNOG-mapper jointly annotated GO terms for 10,190 genes and KEGG pathways for 4,863 genes (Table 4)

Chromosomal synteny analysis

In order to explore interspecific chromosomal relationships, chromosomal synteny analysis was conducted for CBT compared with B. mori (Lepidoptera) and M. sexta (Lepidoptera) (Table 5). Protein sequences were aligned using Diamond with parameter of “--ultra-sensitive --iterate -e 1e-5 -k 5”. Subsequently, chromosomal synteny was analyzed using MCScanX46 with the parameter of “-s 5 -e 1e-5”. The results indicated that exceedingly notable synteny between the genome chromosomes of CBT and M. sexta was observed (Fig. 4). A chromosomal fission or fusion events occurred between M. sexta Chr28 and CBT Chr15 + Chr29. The synteny between the chromosomes of CBT and B. mori genome was also strong but slightly lower than that between CBT and M. sexta, and chromosomal fusion or fission events were more frequent. Moreover, the autosomes and sex chromosome Z were also determined by chromosome synteny, according to the relatively conserved feature in the Lepidoptera sexual chromosome Z47. Conclusively, the chromosome 1 was confirmed Z chromosome by sharing high synteny features with B. mori and M. sexta Z chromosomes (Fig. 4).

Fig. 4
figure 4

Chromosomal synteny among C. bilineata tsingtauica (Cbil), Bombyx mori (Bmor) and Manduca sexta (Msex). Color stripes represent the major occurrence of chromosomal fissions or fusions.

Data Records

The Hi-C, PacBio HiFi, ONT RNA seq, RNA seq, and WGS data for the CBT genome can be found on NCBI with the accession numbers SRR27748981‒SRR2774898548 under BioProject accession number PRJNA106022249. The assembled genome has been deposited in the NCBI assembly with the accession number GCA_036417725.150. Additionally, the annotation results of the CBT genome have been stored in the Figshare51.

Technical Validation

Three methods were used to assess the quality of the CBT genome assembly. Firstly, the purity of the genome DNA was verified using a NanoDrop 2000 spectrophotometer and Qubit fluorometric quantitation. The integrity of the genome DNA was checked via pulsed-field gel electrophoresis and agarose gel electrophoresis. The absorbance at 260/280 nm was approximately 1.89. Secondly, we used compleasm v0.2.4 with the insecta_odb10 database (n = 1,367 orthologues) as a reference to assess the completeness of the genome assembly. The assessment showed that the completeness of BUSCO was 99.49%, including 99.34% single-copy BUSCOs, 0.15% duplicated BUSCOs, 0.15% fragmented BUSCOs, and 0.37% missing BUSCOs (Table 2). The predicted protein-coding gene sequences were evaluated for BUSCO completeness, resulting in C: 98.90% [S:77.47%, D:21.43%], F:0.07%, M:1.02%. Thirdly, reads were aligned back to the assembly results using Minimap2, and the mapping rates for WGS, RNA-seq, RNA-ONT, and HiFi data were all over 90% (Table 2).