Background & Summary

Tortricidae, the leafroller moths, is one of the largest families of Lepidoptera (butterflies and moths)1, including numerous notorious economic pests such as the spruce budworm, Choristoneura fumiferana2, oriental fruit moth Grapholita molesta3 and codling moth, Cydia pomonella4. The two main subfamilies are Tortricinae and Olethreutinae, which are relatively young5, comprising over 95% of tortricid species. Genomes of many species in these two subfamilies have been determined6, revealing an ancestral sex chromosome-autosome fusion and two subsequent autosome fusions relative to the ancestral karyotype of Lepidoptera7. Compared to the two successful subfamilies, the relict subfamily Chlidanotinae is much more limited in distribution range, host range, species richness, and population size. Species of this subfamily are mainly distributed in tropical regions, indicating varied climatic adaptability compared to species of the other subfamilies. Thus, this group can provide valuable insights into the phylogeny and pest adaptation and evolution of Tortricidae. However, no genome has been assembled for species of Chlidanotinae.

Here, we present the first chromosome-level genome assembly and annotation in the Olethreutinae using high-coverage long-read and Hi-C sequencing from Polylopha cassiicola8. This species is mainly distributed in the southern coastal regions of China and Southeast Asia. It is a pest of trees Cinnamomum cassia and C. camphora. We also assembled the mitochondrial genome of this species from the Illumina short sequencing reads. These genomes are expected to provide information for understanding the phylogeny, karyotypic, and adaptive evolution of Tortricidae.

Methods

Sample collection and sequencing

P. cassiicola larvae were collected from the tops of C. camphora in Guangxi, China. The larvae were reared in the laboratory to pupae and adults for genomic and transcriptome sequencing. Three individuals were used for three types of genome sequencing: one male pupae for Nanopore long-read sequencing, one male pupae for Illumina short-read sequencing, and one female adult for Hi-C sequencing. In addition, four larvae were used for RNA sequencing. Nucleic acid extraction and sequencing libraries was contracted by BerryGenomic (Beijing, China). Methods for nucleic acid extraction, platforms for sequencing, and sequencing outputs are provided in Table 1.

Table 1 Methods and outputs for sequencing experiments.
Table 2 Statistics of repeat elements and non-coding RNAs in Polylopha cassiicola genome.

Genome assembly

The Nanopore long reads were assembled into 76 contigs using NextDenovo 2.5.2 (https://github.com/Nextomics/NextDenovo) with parameters: “read_cutoff = 4k, genome_ size = 400 m, nextgraph_options = -a 1”. Redundant sequences in contigs were removed using Purge_dups9. The cleaned contigs containing 65 sequences were then assembled to chromosome-level using Hi-C information. In this analysis, we mapped the Hi-C reads to cleaned contigs using BWA10 with options: “mem -SP5”, anchored contigs using YaHS 1.2a.111 with option: “-e GATC”, and manually adjusted using Juicerbox 1.22.0112. We removed the contigs that did not have any contact information with the chromosomes, which could be from potential contamination, such as symbiotic microbes. At last, the chromosomal-level genomic sequences were subjected to two rounds of long-read polishing and two rounds of short-read polishing using Nextpolish 1.4.113. The obtained P. cassiicola genome is 302.03 Mb in size and contains 21 autosomes and one Z sex-chromosome (Fig. 1a).

Fig. 1
figure 1

Genomic feature of nuclear genome of Polylopha cassiicola. (a) Hi-C contact matrix of 22 putative chromosomes. (b) Synteny among four tortricid species from four subfamilies and an outgroup. The labels at the bottom marked the ancestral linkage groups of Lepidoptera6.

We also assembled mitochondrial genome using MitoZ 3.614 based on the short-reads. In the mitochondrial genome, we identified 13 protein-coding genes, 22 tRNAs, and 1 rRNA (Fig. 2).

Fig. 2
figure 2

Distribution of annotated genes on mitochondrial genome. The inner ring shows the relative read coverage.

Genome synteny

We analysed the chromosomal synteny between P. cassiicola and three other species from Tortricidae and one from Sesiidae: Choristoneura fumiferana (Tortricidae: Tortricinae)2, Grapholita molesta (Tortricidae: Olethreutinae)3, Tortricodes alternella (Tortricidae: Tortricinae; NCBI GenBank assembly: GCA_947859335.115), and Sesia bembeciformis (Sesiidae: Sesiinae)16. Synteny analysis was conducted using MSCANX pipeline in JCVI utility libraries17. We assigned names of the ancestral linkage group in Lepidoptera6 (Merian elements, M1-31 and MZ) based on chromosomal homology. The results show different patterns of chromosomal fusion in species T. alternella and P. cassiicola (Fig. 1b).

Repeat element and non-coding RNA annotation

Repeat elements were detected using RepeatMasker 4.1.518 with options “-no_is -norna -xsmall -q”. This analysis was conducted against three databases: Repbase (http://www.girinst.org), Dfam database1 specific to Arthropoda, and a species-specific repeat library constructed using RepeatModeler219. Transfer RNA (tRNA) was predicted by tRNAscanSE 2.0.1220 with default parameters, and ribosome RNA (rRNA) was predicted using Barrnap 0.9 (https://github.com/tseemann/barrnap). In the P. cassiicola genome, 36.82% of bases were annotated as repeat elements (Table 2). We identified 67 rRNAs, and 1052 tRNAs (Table 2).

Table 3 Statistics of Polylopha cassiicola assemblies.

Gene prediction and functional annotation

Gene structure was predicted using an ab initio method, Helixer21, with options: “–subsequence-length 320760–batch-size 6”, and with a pre-trained model for invertebrate “invertebrate_v0.3_m_0200”. Gene function, Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) items for predicted genes were annotated using eggNOG-Mapper22 web tools, against the eggNOG Database 5. A total of 15412 protein-coding genes were predicted, in which 12671 genes were functionally annotated.

Data Records

The Nanopore reads, Illumina reads, Hi-C reads, and RNA reads for P. cassiicola genome assembly were deposited at NCBI under Sequence Read Archive under accession number SRP47975923. The nuclear and mitochondrial genome assemblies were deposited in Genbank under accession number GCA_038024825.124. The genome annotation files are available in Figshare25 at https://doi.org/10.6084/m9.figshare.24902046.

Technical Validation

To validate the accuracy of the final genome assembly, we mapped the Illumina short reads and Nanopore long reads to the P. cassiicola genome using Minimap226 with option “-ax sr” for short reads and option “-ax map-ont” for long reads. The mapping rates for the short reads and long reads were calculated using Samtools27. Analysis revealed 96.38% and 98.73% mapping rates for the short and long reads, respectively. We examined the coverage of short reads along the mitochondrial genome and showed 100% coverage (Fig. 1b).

Completeness of the assembly and gene prediction were evaluated using BUSCO 5.4.728 with lepidoptera_odb10 database. In this analysis, BUSCO examined the states and proportions of 5,286 single-copy orthologous of Lepidoptera in our genome assembly: single-copy (S), duplication (D), fragment (F), and missing (M). The analyses showed completeness ranging 95.1%–98.4% for each assembly stage (Table 3), and 97.8% for predicted gene set: “C: 97.8% [S: 97.2%, D: 0.6%], F: 0.9%, M: 1.3%”. Quality of gene prediction was manually evaluated using RNA-seq data. Specifically, RNA-seq reads were mapped to the genome using Hisat29 and Samtools27. We imported the obtained BAM file and annotation file into the IGV browser30. Based on manual examination, we found that the machine learning-based annotation method has predicted a near-complete gene structure. These results indicate that we have obtained a high-quality assembly and annotation for P. cassiicola genome.