Background & Summary

Thrips are small insects from the order Thysanoptera. Among the currently described thrips, only about 150 species are recognized as pests1. The flower thrips Frankliniella intonsa is a common species found in flowers of many plants. It is native to Eurasia, but now introduced to Oceania and North America2,3,4,5,6. Despite their small body size allowing for easy dispersal, the distribution of F. intonsa remains limited compared to a cosmopolitan pest from the same genus, the western flower thrips, Frankliniella occidentalis7,8,9. In its native range, F. intonsa was reported as a pest at times10 but often found alongside other thrips in the field, leading to species competition and displacement11,12,13,14,15. However, in recent years, F. intonsa has been more frequently treated as a pest of crops13,16. In some regions, F. intonsa has developed resistance to insecticides used for its control17,18. In addition, F. intonsa has been found as a vector of plant virus from the genus Tospovirus19,20,21, although its transmission efficacy is lower than F. occidentalis11. Therefore, we need to understand its biology, ecology, and evolution, as well as its competition with other species, to reassess the pest status of F. intonsa and develop a proper control strategy22,23. Well-assembled genomes will provide genetic resources for the study of F. intonsa. Currently, genomes of thrips have been reported for the western flower thrips Frankliniella occidentalis24, tobacco thrips Frankliniella fusca25, melon thrips Thrips palmi26, bean flower thrips Megalurothrips usitatus27,28 and rice thrips Stenchaetothrips biformis29. Recently, a parallel study of ours published a genome for F. intonsa that represents the first chromosome-scale genome for the species of the genus Frankliniella30. The specimens used for F. intonsa genome sequencing were collected from Zhejiang Province of southern China30. Here, we assembled another chromosome-level genome for F. intonsa, which was sequenced from specimens collected from Inner Mongolia of northern China, to enrich the genetic resources of this species. We utilized Illumina short-read sequences to estimate the genome features of F. intonsa. We also employed Oxford Nanopore Technologies (ONT) long-read sequences to assemble a contig-level genome. Furthermore, we utilized chromosome conformation capture (Hi-C) technology to assemble these contigs into a chromosome-level genome.

Methods

Sample collection and genomic DNA sequencing

A strain used for genome sequencing was reared for 10 generations in the laboratory at the College of Forestry, Inner Mongolia Agricultural University, Hohhot, China. About 100 unsexed adults collected from Huanghuagou Scenic Area in Chaha’er Right Wing Central Banner, Inner Mongolia, China (E 112°32′03″, N 41°08′17″) were used to establish the strain. Frankliniella intonsa was reared on the seedling of horsebean Vicia faba under the following laboratory conditions: 25 °C, 60% relative humidity and a 16 L:8D photoperiod. The specimens used for sequencing were morphologically identified to avoid the inclusion of other thrips species. About 1,000 adults with pooled male and female samples were utilized for the extraction of high-molecular-weight DNA (HMW DNA) and subsequent library construction. Genomic DNA was extracted from the entire body of pooled individuals using the Qiagen MagAttract HMW DNA Mini Kit, following the manufacturer’s protocol. A short-read DNA library with an insert size of 500 bp was constructed using the Illumina TruSeq DNA PCR-Free HT LPK and sequenced on the Illumina X Ten platforms (Illumina Inc., San Diego, CA, USA). A long-read DNA library with an insert size of 23 kb was prepared according to the manufacturer’s protocol and sequenced using the PromethION model of the ONT platform. The short reads were used for genome survey analysis, including estimating the genome size, and rates of heterozygosity and duplication, as well as for correcting the assembly from the long sequencing reads, while the long reads were used for the contig-level genome assembly. The sequencing process generated 15.55 Gb (73.88X coverage) of clean short-read data and 28.35 Gb (135.65X coverage) of long-read data, respectively (Table 1).

Table 1 Sequencing data generated in this study for genome assembly of Frankliniella intonsa.

Hi-C library construction and sequencing

The chromosome conformation of the genome was captured to determine the order and orientation of the contigs. Approximately 1,000 adults of mixed sex were used for constructing the Hi-C library. The specimens were ground and then cross-linked in a fresh, ice-cold nuclear isolation buffer with a 2% formaldehyde solution for 10 minutes at room temperature. The fixed cells were digested using DpnII (NEB) enzymes and processed according to the standard operating procedure for Hi-C library construction, which included cell lysis, incubation, labelling the DNA ends with biotin-14-dCTP, and performing blunt-end ligation of crosslinked fragments. The Hi-C library was amplified by 12–14 PCR cycles and sequenced on the Illumina NovaSeq. 6000 platform. A total of 26.97 Gb of clean data were generated, representing 120.05X coverage of the genome (Table 1).

Genome characteristics estimation

Genome characteristics were estimated based on Illumina short-reads. The raw sequences were trimmed using the software fastp31 under the default parameters. KMC version 3.032 was used to count the K-mer distribution histogram under 17, 21, 27, 31 and 41-mer with parameters ‘-m96 -ci1 -cs10000’ and ‘-cx10000’, based on the trimmed data. The genome size, heterozygosity rate, and duplication rate were estimated using GCE version 2.0 under the default parameters33. The estimated genome size decreased as the K-mer increased, ranging from 230 Mb to 255 Mb, similar to a previous study of this species30. The genome duplication decreased as the K-mer increased, with values ranging from 2.71% to 3.22%, higher than a previous study of this species (2.04%)30. Each K-mer distribution showed double-peaks, indicating a highly complex genome (Table 2, Fig. 1).

Table 2 Statistics for chromosomal-level assembly and annotation of Frankliniella intonsa genome.
Fig. 1
figure 1

Estimated characteristics of Frankliniella intonsa genome based on Illumina short-read data. Results were obtained in GenomeScope version 2.0 with 17- (A), 21- (B), 27- (C), 31- (D) and 41- (E) mer. The K-mer distributions showed double peaks: the first peak indicates genome duplication and the highest peak represents a genome size peak. len, estimated genome size in bp; aa, homozygosity rate; ab, heterozygosity rate; dup, duplication rate.

Genome assembly and annotation

The long-reads from ONT were quality-controlled and assembled into contigs using a “correct-then-assemble” strategy in nextDenovo version 2.5.234 with parameters ‘read_cutoff = 1k, genome_size = 400 m, pa_correction = 20, sort_options = -m 100 g -t 10, minimap2_options_raw = -t 10, correction_options = -p 15, minimap2_options_cns = -t 10, nextgraph_options = -a 1’. These contigs were then polished three times based on the Illumina short reads using pilon version 1.2235 under the default parameters. The polished contigs were further assembled into a chromosomal-level genome using Hi-C sequencing data. Low-quality reads and adapters from the Hi-C library were filtered using Trimmomatic version 0.3936 under the default parameters and then mapped to the assembled contigs using Juicer37 with default parameters. The reads were grouped into chromosomes using 3D de novo assembly (3D-DNA) version 180922 with parameters ‘–editor_repeat_coverage = 15, -r 2’38. Mistakes were manually adjusted in Juicebox version 2.16.00 (https://github.com/aidenlab/Juicebox), and the raw-chromosomes were updated using the script “run-asm-pipeline-post-review.sh” in 3D-DNA again. At last, the repeat-masked high-quality genome assembly was submitted to the online tool Helixer39 under the invertebrate mode for genome structure annotation. Functional annotation was performed by blasting the proteins against the Uniport/SwissProt database using blastp version 2.12.0+40 under the following parameters: ‘-evalue 0.000001 -outfmt 6 -num_threads 128 -num_alignments 1 -seg yes -soft_masking true -lcase_masking -max_hsps 1’. In total 422,839 contigs were assembled into 15 chromosomes (Fig. 2). The largest chromosome size was 21.406 Mb and the shortest was 10.106 Mb. We numbered the chromosomes in descending order of their size. The total length of the anchored genome was 209.09 Mb with an N50 of 13.415 Mb. About 57 Mb contigs were not anchored to any chromosome. The anchored genome size is shorter than the estimated genome size and a previously assembled genome for this species30. Both anchored and unanchored contigs were submitted to GenBank with accession numbers CM069028.1- CM069042.1. In total, 14,109 protein-coding genes (PCGs) were identified with 9,931 genes have functional annotation41. The G + C content of the final genome assembly was 51.75% (Table 2).

Fig. 2
figure 2

Genome-wide contact matrix of Frankliniella intonsa generated using Hi-C data. Each blue square represents a chromosome, each green square represents a contig. Fifteen chromosomes were anchored under the default parameters of Juicer and 3D-DNA software. Numbers on the top and left axes show the chromosome length in Mb, numbers on the bottom axes show the chromosome number. Chromosomes are numbered based on their size, from the largest to the smallest.

Repeat elements and non-coding RNA predictions

The repetitive elements longer than 1000 bp were identified against the Insecta repeats within RepBase Update (20120418). The identification was performed using RepeatMasker version open-4.0.042 (-no_is -norna -xsmall -q) with the search engine RM-BLAST (v2.2.23+). De novo identification of transposable elements (TEs) was performed using RepeatModeler43. Non-coding RNAs were identified using Rfam44,45, while ribosome RNAs (rRNAs) and transfer RNAs (tRNAs) were searched by tRNAscan-SE v2.046 and RNAmmer v1.247, both under default parameters. A total of 393,270 transposable elements (TEs) were identified, including 3,903 retroelements with a total length of 452,458 bp, 8,858 DNA transposons and 380,509 Tandem Repeats (TRs) (Table 3). We identified 48 miRNAs, 87 snRNAs, 30 snoRNAs, 143 rRNAs and 183 tRNAs in F. intonsa genome (Table 4).

Table 3 Repeated elements identified in the Frankliniella intonsa genome.
Table 4 Non-coding RNA identified in the Frankliniella intonsa genome.

Data Records

The genome project was deposited at NCBI under BioProject No. PRJNA1016113. Genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI under accession SRR2610549448. Genomic ONT sequencing data were deposited in the Sequence Read Archive at NCBI under accession SRP46158349. The Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession SRR2612292850. The genome assembly, genome annotation, and protein coding genes files were deposited in Figshare under a DOI of https://doi.org/10.6084/m9.figshare.24174591.v541. The final genome assembly was also deposited in GenBank at NCBI under the accession number GCA_035584235.151.

Technical Validation

The extracted high molecular weight (HMW) DNA had an average size of approximately 23 Kb, as determined by pulsed-field gel electrophoresis. To assess the integrity and quality of the genome assembly and the set of protein-coding genes, Benchmarking Universal Single-Copy Orthologs (BUSCO) version 5.4.552 was used. For the chromosome-level genome assembly, the BUSCO completeness was 93.3%, 95.6%, 96.1% and 95.0%, based on the Eukaryota, Metazoa, Arthropoda and Insecta (odb_10, released on 2024-01-08) datasets, while the previously assembled genome has a completeness of 96.9%–98.8%30. For the protein-coding gene set, the BUSCO completeness was 93.0%, 94.6%, 96.3% and 95.2% based on the Eukaryota, Metazoa, Arthropoda and Insecta datasets, respectively, while the previously assembled genome has a completeness of 89.5%–94.4%30. We mapped our Illumina short-read to the assembled genomes using BWA version 0.7.17-r1198-dirty53 under the BWA-MEM algorithm. The mapping rate of short-reads data to our unmasked chromosomal-level genome and that of Zhang et al.30 is 81.92% and 87.30%, respectively. We also mapped the Illumina short-read of Zhang et al.30 and obtained a mapping rate of 84.04% for our genome assembly and 92.80% for the assembly of Zhang et al.30.