Abstract
The soybean hawkmoth Clanis bilineata tsingtauica Mell (Lepidoptera, Sphingidae; CBT), as one of the main leaf-chewing pests of soybeans, has gained popularity as an edible insect in China recently due to its high nutritional value. However, high-quality genome of CBT remains unclear, which greatly limits further research. In the present study, we assembled a high-quality chromosome-level genome of CBT using PacBio HiFi reads and Hi-C technologies for the first time. The size of the assembled genome is 477.45 Mb with a contig N50 length of 17.43 Mb. After Hi-C scaffolding, the contigs were anchored to 29 chromosomes with a mapping rate of 99.61%. Benchmarking Universal Single-Copy Orthologues (BUSCO) completeness value is 99.49%. The genome contains 252.16 Mb of repeat elements and 14,214 protein-coding genes. In addition, chromosomal synteny analysis showed that the genome of CBT has a strong synteny with that of Manduca sexta. In conclusion, this high-quality genome provides an important resource for future studies of CBT and contributes to the development of integrated pest management strategies.
Similar content being viewed by others
Background & Summary
The soybean hawkmoth Clanis bilineata tsingtauica Mell, (Lepidoptera, Sphingidae, Clanis; CBT), an agricultural pest infesting soybean, is mainly distributed in China, Japan, and the Korean Peninsula1. The CBT larvae has five instars, and the fifth instar is the larval gluttonous stage2. In severe cases, the larvae can lead to only the stems remain of the plant, crop failure, or even plant death3 (Fig. 1).
Meanwhile, CBT has a long history of consumption as a crucial edible insect in China4. The 5th instar larval meat is used freeze-dried, fried, soup and canned5. The larvae of CBT are nutrient-rich and have abundant essential amino acids, which can be used as a high-quality protein source6. At present, CBT is mainly obtained through artificial rearing7. The artificial rearing of CBT has become a promising agricultural industry in China, with an annual production of 30,000 tons and an output value of nearly 620 million dollars8.
Sphingidae has about 1,500 insects worldwide9, and many of which are considered significant agricultural pests, such as the tobacco hornworm (Manduca sexta) and sweet potato hornworm (Agrius convolvuli). However, the genome of hawkmoth has been poorly studied. To date, genome assembly can been retrieved for only 14 species of Sphingidae (as of January 2024 from NCBI), including Hyles lineata (Macroglossinae)10, Hyles euphorbiae (Macroglossinae)11, Mimas tiliae (Smerinthinae)12, Deilephila porcellus (Macroglossinae)13, and M. sexta (Sphinginae)14.
In the present study, we assembled a chromosome-level genome of CBT for the first time using PacBio HiFi reads and Hi-C sequencing technologies. We annotated repeat elements, non-coding RNAs (ncRNAs), and protein-coding genes of this genome. Additionally, we performed chromosomal synteny analysis of the CBT genome with those of Bombyx mori and M. sexta. The high-quality genome of CBT is greatly helpful for understanding and conducting further study of utilization as edible insect, damage mechanism and relevant integrated pest management strategies of sphingid species.
Methods
Sample collection and sequencing
The sample of fifth instar CBT larvae were collected from soybean field, and its original population derived from Lianyungang, Jiangsu Province, China. Subsequently, larvae were placed in incubator with a temperature of 26 ± 1°C, relative humidity of 60% ± 10%, and photoperiod of 14 h L: 10 h D. After two days of starvation treatment, washed the larvae with distilled water and placed them in liquid nitrogen.
Genomic DNA from CBT was extracted using the CTAB method. According to the manufacturer’s instructions, a short-read library was constructed using the Agencourt AMPure XP-Medium kit with an insert size of 200‒400 bp and was sequenced on DNBSEQ-T7 platform. A PacBio HiFi library with an insert size of 15 Kb was constructed using the SMRTbell® Express Template Prep Kit 2.0. And HiFi library was sequenced on PacBio Sequel IIe platform. The Hi-C sequencing was carried out by digesting extracted DNA with the Mbol restriction enzyme on Illumina Xplus platform. Next-generation RNA-seq library was constructed using the VAHTS mRNA-seq v2 Library Prep Kit and also was sequenced on Illumina Xplus platform. The third-generation full-length RNA sequencing library of Oxford Nanopore Technologies (ONT) was constructed using the SQK-PCS109 + SQKPBK004 Kit by BenaGen (Wuhan, China) and sequenced on Oxford Nanopore PromethION platform. All library constructions and sequencing were completed by Berry Genomic (Beijing, China), expect the construction and sequencing of ONT RNA library. Finally, we obtained 30.59 Gb (64.07×) of Whole-Genome Sequencing (WGS) raw data, 36.70 Gb (76.87×) of HiFi data, 74.68 Gb (156.42×) of Hi-C data, 11.81 Gb of RNA-seq data, and 13.71 Gb of RNA-ONT data (Table 1) with high quality (Tables S1–S5).
Genome assembly
We used pbccs v6.4.0 (https://github.com/PacificBiosciences/ccs) to filter low-quality HiFi reads below Q20 base quality. Then we used Hifiasm v0.19.615 with default parameters for the initial round of assembly and only retained contig assembly sequences with coverage depth exceeding 6×. Subsequently, Hi-C data and the YAHS v1.216 pipeline were utilized for anchoring contigs onto chromosomes and assembly. Hi-C data was quality controlled and aligned to the genome using chromap v0.2.517. Two rounds of scaffolding were performed using YAHS v1.2 with default parameters. The assembly results from the initial round of scaffolding were manually corrected using Juicebox v1.11.0818, then performed the second round of scaffolding. The sequencing coverage of each pseudochromosome was evaluated by SAMtools v1.108 (https://www.htslib.org). The Hi-C interaction heatmap reveals a remarkably high quality of scaffolding (Fig. 2). We used MMseq 2 v1319 to perform blastn-like searches to detect potential contaminants in the assembly based on the NCBI nt and UniVec databases. Minimap2 (https://github.com/lh3/minimap2) was used to align reads back to the genome assembly. Compleasm v0.2.420 based on insecta_odb10 dataset (n = 1,367 orthologues) and merqury v1.321 were respectively used to assess completeness of Benchmarking Universal Single-Copy Orthologues (BUSCO) and the single-base quality value (QV).
Finally, we obtained the high-quality chromosome-level genome of CBT, with the genome size of 477.45 Mb and GC content of 38.55% (Table 2). The assembly included 66 contigs and 56 scaffolds, with both scaffold N50 and contig N50 lengths of 17.43 Mb. 475.61 Mb of contigs were anchored to 29 pseudochromosomes, with a rate of 99.61%. The BUSCO assessment of genome completeness was 99.49% (C), with only 0.15% duplicated BUSCOs (D), 0.15% fragmented BUSCOs (F), and 0.37% missing BUSCOs (M). The mapping rates for WGS, HiFi, RNA-seq, and RNA-ONT data were 97.86%, 99.90%, 95.78%, and 90.39%, respectively (Table 2). Chromosome 29 was the shortest, with a length of 8,386,962 bp, while chromosome 22 was the longest at 25,470,929 bp. The overall average length of the chromosomes was 16,400,209 bp. In terms of sequencing quality, the mean QV across all chromosomes was approximately 58, while the average sequencing coverage depth was about 70× for HiFi and 61× for WGS (Table 3). These indicators suggest that the assembly of CBT genome is of extremely high quality in terms of completeness and continuity. In addition, we found a complete mitochondrial whole genome sequence in the genome assembly, with a length of 15,417 bp and annotated by MitoZ v3.622 (Fig. S1).
Genome annotation
We employed RepeatModeler v2.0.523 and the “LTRStruct” LTR discovery pipeline to construct a repeat library. This library was combined with the Dfam 3.724 and RepBase-2018102625 databases to form a custom library. Repeat elements were identified by aligning the genome with the custom library using RepeatMasker v4.1.526. The analysis revealed 252.16 Mb repeat elements, accounting for 52.81% of the genome. The major repeat elements included LINEs (14.73%), SINEs (14.56%), Unclassified (12.43%), LTRs (3.60%), Rolling-circles (3.42%), and DNA elements (3.06%) (Table 4; Table S6). Subsequently, Infernal v1.1.527 searched for non-coding RNAs based on Rfam database. And tRNAs were predicted using tRNAscan-SE v2.0.1228. Low-confidence tRNAs were filtered using the built-in ‘EukHighConfidenceFilter’ script. In total, we annotated 1,434 ncRNAs, mainly including 170 rRNAs, 74 miRNAs, 76 snRNAs, and 636 tRNAs (Table 4; Table S7). Moreover, genome characteristic visualization was performed with TBTools-II v2.04229 in combination with annotation (Fig. 3).
Protein-coding genes were annotated using MAKER v3.01.0430 by integrating three strategies: ab initio prediction, transcriptome-based and homology-based prediction. BRAKER v3.0.631 and GeMoMa v1.932 were used to integrate transcriptome and protein evidence, with their prediction results combined as ab initio input file for MAKER. Transcriptome alignment BAM files were generated using HISAT2 v2.2.133. BRAKER automatically trained Augustus v3.4.034 and GeneMark-ETP35, and combined transcriptome data and arthropod homologous protein sequences from OrthoDB11 database36 to improve prediction accuracy. Additionally, homology-based prediction was performed using GeMoMa based on the annotation of genes of Drosophila melanogaster (Diptera), M. sexta (Lepidoptera), Amyelois transitella (Lepidoptera), B. mori (Lepidoptera), and Spodoptera frugiperda (Lepidoptera) from GenBank (Table 5). For transcriptome-based prediction approach, the transcriptome was assembled using StringTie v2.2.137, and BAM files were generated with HISAT2.
In the end, we predicted 14,214 protein-coding genes in the CBT genome by using MAKER, with an average gene length of 16,966.9 bp. The average number of exons, introns, and CDS of each gene were 7.7, 6.7, and 7.4, respectively (Table 4). The average length of exons, introns, and CDS of each gene were 314.6 bp, 2,347.9 bp, and 222.5 bp, respectively (Table 4). What’s more, BUSCO completeness of the predicted protein-coding gene sequences was 98.90%, including 77.47% single-copy, 21.43% duplicated, 0.07% fragmented, and 1.02% missing BUSCOs.
Functional annotation of the genes was performed using Diamond v2.1.7.16138 (-very-sensitive -e 1e-5) by searching against the UniProtKB v202305 database. For further gene functional annotation, InterPro 5.65–97.039 was used to search databases including Pfam40, SMART41, Superfamily42, and CDD43. The eggNOG v5.0.244 database (http://eggnog6.embl.de) was searched by eggNOG-mapper v2.1.1245. After integrating these results, we found that 13,889 (97.71%) genes were functionally annotated against the UniProtKB database. InterPro identified structural domains for 11,694 protein-coding genes. InterPro and eggNOG-mapper jointly annotated GO terms for 10,190 genes and KEGG pathways for 4,863 genes (Table 4)
Chromosomal synteny analysis
In order to explore interspecific chromosomal relationships, chromosomal synteny analysis was conducted for CBT compared with B. mori (Lepidoptera) and M. sexta (Lepidoptera) (Table 5). Protein sequences were aligned using Diamond with parameter of “--ultra-sensitive --iterate -e 1e-5 -k 5”. Subsequently, chromosomal synteny was analyzed using MCScanX46 with the parameter of “-s 5 -e 1e-5”. The results indicated that exceedingly notable synteny between the genome chromosomes of CBT and M. sexta was observed (Fig. 4). A chromosomal fission or fusion events occurred between M. sexta Chr28 and CBT Chr15 + Chr29. The synteny between the chromosomes of CBT and B. mori genome was also strong but slightly lower than that between CBT and M. sexta, and chromosomal fusion or fission events were more frequent. Moreover, the autosomes and sex chromosome Z were also determined by chromosome synteny, according to the relatively conserved feature in the Lepidoptera sexual chromosome Z47. Conclusively, the chromosome 1 was confirmed Z chromosome by sharing high synteny features with B. mori and M. sexta Z chromosomes (Fig. 4).
Data Records
The Hi-C, PacBio HiFi, ONT RNA seq, RNA seq, and WGS data for the CBT genome can be found on NCBI with the accession numbers SRR27748981‒SRR2774898548 under BioProject accession number PRJNA106022249. The assembled genome has been deposited in the NCBI assembly with the accession number GCA_036417725.150. Additionally, the annotation results of the CBT genome have been stored in the Figshare51.
Technical Validation
Three methods were used to assess the quality of the CBT genome assembly. Firstly, the purity of the genome DNA was verified using a NanoDrop 2000 spectrophotometer and Qubit fluorometric quantitation. The integrity of the genome DNA was checked via pulsed-field gel electrophoresis and agarose gel electrophoresis. The absorbance at 260/280 nm was approximately 1.89. Secondly, we used compleasm v0.2.4 with the insecta_odb10 database (n = 1,367 orthologues) as a reference to assess the completeness of the genome assembly. The assessment showed that the completeness of BUSCO was 99.49%, including 99.34% single-copy BUSCOs, 0.15% duplicated BUSCOs, 0.15% fragmented BUSCOs, and 0.37% missing BUSCOs (Table 2). The predicted protein-coding gene sequences were evaluated for BUSCO completeness, resulting in C: 98.90% [S:77.47%, D:21.43%], F:0.07%, M:1.02%. Thirdly, reads were aligned back to the assembly results using Minimap2, and the mapping rates for WGS, RNA-seq, RNA-ONT, and HiFi data were all over 90% (Table 2).
Code availability
No specific script was used in the present study. All commands and pipelines used of this work in data processing were performed according to the manual and protocols of the relevant bioinformatic software. All commands used in this work could be inquired in the Figshare52.
References
Pittaway, A. R., Kitching, I. J. Notes on selected species of hawkmoths (Lepidoptera: Sphingidae) from China, Mongolia and the Korean Peninsula. Tinea, 16, 170–211 (2000).
Liu, X. F. et al. Evaluation of rearing factors affecting Clanis bilineata tsingtauica Mell larvae fed by susceptible soybean variety NN89-29 in spring and autumn sowing. Insects 14, 32 (2023).
Tian, H. Harm and comprehensive control of Clanis bilineata tsingtauica Mell. J. Nanyang Norm. Univ. 8, 58–60 (2009).
Gao, Y., Zhao, Y. J., Xu, M. L. & Shi, S. S. Clanis bilineata tsingtauica: a sustainable edible insect resource. Sustainability. 13, 12533 (2021).
Gao, Y., Zhao, Y. J., Xu, M. L. & Shi, S. S. Soybean hawkmoth (Clanis bilineata tsingtauica) as food ingredients: a review. CyTA - J. Food. 19, 341–348 (2021).
Su, Y. et al. Nutritional properties of larval epidermis and meat of the edible insect Clanis bilineata tsingtauica (Lepidoptera: Sphingidae). Foods. 10, 2895 (2021).
Mao, Y. M. & Wang, K. L. Modulation of the growth performance, body composition and nonspecific immunity of white shrimps (Penaeus vannamei) upon dietary Clanis bilineata larvae. Aquac. Rep. 24, 101108 (2022).
Guo, M. M. et al. Diapause termination and post-diapause of overwintering Clanis bilineata tsingtauica larvae. Chin. J. Appl. Entomol. 58, 966–972 (2021).
Stöckl, A. L. & Kelber, A. Fuelling on the wing: sensory ecology of hawkmoth foraging. J. Comp. Physiol. A. 205, 399–413 (2019).
Godfrey, R. K., Britton, S. E., Mishra, S., Goldberg, J. K. & Kawahara, A. Y. A high-quality, long-read genome assembly of the whitelined sphinx moth (Lepidoptera: Sphingidae: Hyles lineata) shows highly conserved melanin synthesis pathway genes. G3. 13, jkad090 (2023).
Hundsdoerfer, A. K. et al. High-quality haploid genomes corroborate 29 chromosomes and highly conserved synteny of genes in Hyles hawkmoths (Lepidoptera: Sphingidae). BMC Genomics. 24, 443 (2023).
Boyes, D. & Holland, P. W. H. The genome sequence of the lime hawk-moth, Mimas tiliae (Linnaeus, 1758). Wellcome Open Res. 6, 357 (2021).
Boyes, D. The genome sequence of the small elephant hawk moth, Deilephila porcellus (Linnaeus, 1758). Wellcome Open Res. 7, 80 (2022).
Gershman, A. et al. De novo genome assembly of the tobacco hornworm moth (Manduca sexta). G3. 11, jkaa047 (2021).
Cheng, H. Y., Concepcion, G. T., Feng, X. W. & Zhang, H. W. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 18, 170–175 (2021).
Zhou, C. X., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 39, btac808 (2023).
Zhang, H. et al. Fast alignment and preprocessing of chromatin profiles with Chromap. Nat. Commun. 12, 6566 (2021).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Huang, N. & Li, H. Compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics. 39, btad595 (2023).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Meng, G. L., Li, Y. Y., Yang, C. T. & Liu, S. L. MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic Acids Res. 47, e63 (2019).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. PNAS. 117, 9451–9457 (2020).
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA. 12, 2 (2021).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 6, 11 (2015).
Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0., Available online: http://www.repeatmasker.org (2013).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).
Chan, P. P. & Lowe, T. M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol. 1962, 1–14 (2019).
Chen, C. et al. TBtools-II: A “one for all, all for one” bioinformatics platform for biological big-data mining. Mol. Plant. 16, 1733–1742 (2023).
Holt, C. & Yandell, M. MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 12, 491 (2011).
Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: Automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 3, lqaa108 (2021).
Keilwagen, J., Hartung, F., Paulini, M., Twardziok, S. O. & Grau, J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics. 19, 189 (2018).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Bruna, T., Lomsadze, A. & Borodovsky, M. GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. Genome Res 34, 757–768 (2024).
Kuznetsov, D. et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 51, D445–D451 (2023).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Buchfink, B., Reuter, K. & Drost, H. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods. 18, 366–368 (2021).
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
Letunic, I., Khedkar, S. & Bork, P. SMART: recent updates, new developments and status in 2020. Nucleic Acids Res. 49, D458–D460 (2021).
Wilson, D. et al. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386 (2009).
Wang, J. et al. The conserved domain database in 2023. Nucleic Acids Res. 51, D384–D388 (2023).
Huerta-Cepas, J. et al. EggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Cantalapiedra, C. P. et al. EggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Bio. Evol. 38, 5825–5829 (2021).
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
Fraïsse, C., Picard, M. A. L. & Vicoso, B. The deep conservation of the Lepidoptera Z chromosome suggests a non-canonical origin of the W. Nat. Commun. 8, 1486 (2017).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP486259 (2024).
NCBI BioProject https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1060222 (2024).
Xing, G. Genbank https://identifiers.org/ncbi/insdc.gca:GCA_036417725.1 (2024).
Xing, G. Chromosome-level genome assembly and annotations of Clanis bilineata tsingtauica Mell (Lepidoptera: Sphingidae). figshare https://doi.org/10.6084/m9.figshare.25151900.v1 (2024).
Xing, G. All commands used for chromosome-level genome assembly and annotations of Clanis bilineata tsingtauica Mell (Lepidoptera: Sphingidae). figshare https://doi.org/10.6084/m9.figshare.26396881.v1 (2024).
Acknowledgements
This research was supported by the National Key R&D Program of China (2021YFD1201604), Natural Science Foundation of China (31571694), Jiangsu Postgraduate Practice and Innovation Program (SJCX23_0210, SJCX21_0225), Key Research Topics for Higher Education Reform in Jiangsu Province in 2023 (2023JSJG154), MOA CARS-04 Program, and Jiangsu JCICMCP Program.
Author information
Authors and Affiliations
Contributions
G.X. and J.G. conceived the research project. Y.Y., L.Y. and N.L. accomplished the collection of samples. Y.Y., K.Z., Y.X. performed the bioinformatic analyses. Y.Y., K.Z. and G.X. wrote the manuscript. G.X. and J.G. revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yan, Y., Zhao, K., Yang, L. et al. Chromosome-level genome assembly and annotation of Clanis bilineata tsingtauica Mell (Lepidoptera: Sphingidae). Sci Data 11, 1062 (2024). https://doi.org/10.1038/s41597-024-03853-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03853-5