Chromosome-level genome assembly of the western flower thrips Frankliniella occidentalis

The western flower thrips Frankliniella occidentalis (Thysanoptera: Thripidae) is a global invasive species that causes increasing damage by direct feeding on crops and transmission of plant viruses. Here, we assemble a previously published scaffold-level genome into a chromosomal level using Hi-C sequencing technology. The assembled genome has a size of 302.58 Mb, with a contig N50 of 1533 bp, scaffold N50 of 19.071 Mb, and BUSCO completeness of 97.8%. All contigs are anchored on 15 chromosomes. A total of 16,312 protein-coding genes are annotated in the genome with a BUSCO completeness of 95.2%. The genome contains 492 non-coding RNA, and 0.41% of interspersed repeats. In conclusion, this high-quality genome provides a convenient and high-quality resource for understanding the ecology, genetics, and evolution of thrips.


Background & Summary
Thrips are a group of tiny insects from the order Thysanoptera.Most thrips species feed on plants and fungi, while only a small number of species are predators of small invertebrates 1 .There are over 7000 species of thrips, with 150 of them being harmful to plants 2 .Pest thrips are causing increasing damage to crop production worldwide 3,4 .Thrips can be easily dispersed through transportation of host plants 5 .The western flower thrips (WFT) Frankliniella occidentalis is one of the most notorious thrips worldwide 6,7 .This species is native to America and has dispersed worldwide since the 1970s as an invasive species 8 .The invasion genetics of this species have been widely investigated [9][10][11][12] .Insecticides were frequently used to control this pest and thus causing pesticide resistance in the field 13,14 .However, resistant mechanisms of WFT to many insecticides remain to be explored 14,15 .In addition to directly feeding on plants, WFT can transmit plant viruses from the genus Tospovirus, making it an important species to understand insect-plant-virus interaction 16,17 .A genome assembly is crucial to understand the complex biology, ecology and genetic of the WFT.The WFT genome is the first that has been assembled in thrips and made publicly available 18 , providing invaluable resources for studying the genetic mechanisms governing pest and vector biology, feeding behaviours, ecology, and resistance to insecticides and development of novel control methods 10,19,20 .However, some genes are scattered across different scaffolds of the currently assembled genome, which hinders the functional genomics study of this species.An improved genome assembly of WFT to a chromosome level will benefit future studies of this important insect pest.Here, we assembled the previously published scaffold-level genome of WFT to a chromosomal level using chromosome conformation capture (Hi-C) technology 18 .

Methods
Sample collection and Hi-C library sequencing.The chromosome conformation of the genome was analysed to determine the order and orientation of the contigs using Hi-C technology.A strain of WFT was reared for approximately 10 generations and used for Hi-C library construction at the College of Forestry, Inner Mongolia Agricultural University, Hohhot, China.Approximately 1000 live adults of mixed sex were ground and then cross-linked in a fresh, ice-cold nuclear isolation buffer with a 2% formaldehyde solution for 10 minutes.The fixed cells were then digested using DpnII (NEB) enzymes, and further processed by cell lysis, incubation, DNA end labelling with biotin-14-dCTP, and blunt-end ligation of crosslinked fragments.The Hi-C library was amplified by 12-14 PCR cycles and sequenced on the Illumina NovaSeq 6000 platform.A total of 36.11Gb of clean data were generated, representing 119.34X coverage of the genome.

Genome characteristics estimation.
Genome characteristics were estimated based on Illumina short-reads.Raw reads of the whole genome sequencing of WFT were downloaded from the NCBI Sequence Read Archive database (accession number of SRR1300140).The raw sequences were trimmed using the software fastp 21 with default parameters.The trimmed data was used to count the K-mer distribution histogram under 17, 21, 27, 31 and 41-mer using KMC v3.0 22 with parameters '-m96 -ci1 -cs10000' and '-cx10000' .The genome size, heterozygosity rate, and duplication rate were estimated using GCE v2.0 23 with default parameters.The estimated genome size and genome duplication decreased as the K-mer increased, ranging from 281 Mb to 287 Mb and  1.75% to 2.65%, respectively.Each K-mer distribution showed single-peak, indicating that the genome of WFT is a simple one (Table 1, Fig. 1).
Genome assembly and annotation.The scaffold-level genome was downloaded from NCBI database (accession number: GCF_000697945) and used for chromosomal-level genome assembly based on Hi-C sequencing data.Low-quality reads and adapters from the Hi-C library were filtered using Trimmomatic v0.39 24 with default parameters and then mapped to the genome contigs using Juicer 25 with default parameters.The reads were grouped into chromosomes using 3D de novo assembly (3D-DNA) 26 with parameters '-editor_repeat_coverage = 15, -r 2' .Error joints were manually adjusted in Juicebox v2.16.00 (https://github.com/aidenlab/Juicebox),and the raw-chromosomes were updated using the script "run-asm-pipeline-post-review.sh" in 3D-DNA again.
The repeat-masked genome assembly was submitted to the online tool Helixer 27 for genome structure annotation under the invertebrate lineage-specific mode.Helier is a novel tool for cross-species gene annotation of large eukaryotic genomes using deep learning algorithms.Functional annotation was performed by BLAST the proteins against the EggNOG v5.0 28 database using eggNOG-Mapper 29 .Additionally, the entire gene sets  were functionally annotated by aligning protein sequences with the Nr database, Uniport_SwissProt, Uniref90, InterPro (-appl pfam, PRINTS, PANTHER, ProSiteProfiles, SMART, CDD, SFLD, AntiFam), KEGG and GO database using BLASTP and InterProScan version 5.59-91.0(https://github.com/ebi-pf-team/interproscan)with an e-value cutoff of e < 10 −5 .The final genome assembly was consisted of 250,191 contigs, which were assembled into 15 chromosomes (Fig. 2).The chromosome sizes ranged from 15.116 Mb to 32.461 Mb, with a total length of 302.58 Mb, a contig N50 length of 1533 bp, and a scaffold N50 length of 19.071 Mb.We numbered the chromosomes in descending order of their size.Compared to the scaffold-level assembly with a size of 415.8 Mb and scaffold N50 of 948.9 kb, the genome size was reduced and became more approximated to the estimated genome size.In total, 16,312 protein-coding genes (PCGs) were identified, which is 547 genes fewer than the official gene set (OGS v1.0) of 16,859 for the scaffold-level assembly.The functionally annotated terms were discrepant according to the reference databases, ranged from 15619 PCGs for Nr database to 370 domains for InterPro database (Table 2).The G + C content of the final genome assembly was 50.75% (Table 1), which is similar to that of the published WFT genome 18 , lower than that of Frankliniella intonsa 30,31 , Megalurothrips usitatus 32,33 , Stenchaetothrips biformis 34 and Thrips palmi 35 , while slightly higher than those of Frankliniella fusca 36 .

Data Records
The genome project was deposited at NCBI under BioProject No. PRJNA1016120.The Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession SRR26106059 43 .The genome assembly, genome structure annotation and protein files were deposited in Figshare under a DOI of https://doi.org/10.6084/m9.figshare.24968679.v1 44.The final genome assembly was also deposited in GenBank at NCBI under the accession number GCA_035583395.1 45 .

technical Validation
Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.5 46 was used to estimate the integrity and quality of the genome assembly and the annotated protein-coding genes based on the Eukaryota, Metazoa, Arthropoda and Insecta (odb_10, released on 2024-01-08) datasets.For the chromosome-level genome assembly, the BUSCO completeness was 97.7%, 98.7%, 98.5% and 97.8% based on the Eukaryota, Metazoa, Arthropoda and Insecta datasets, respectively.For the protein-coding gene set, the BUSCO completeness was 93.3%, 95.6%, 95.7% and 95.2% based on the Eukaryota, Metazoa, Arthropoda and Insecta datasets, respectively.
To avoid the genetic differences of samples for assembly, we mapped the Illumina short-reads for scaffold-level assembly and Hi-C library sequencing reads obtained in our study to our assembled chromosome-level genomes using BWA version 0.7.17-r1198-dirty 47 .The mapping rate of Illumina short-reads and Hi-C sequencing data was 94.70% and 95.15%, respectively.

Fig. 2
Fig. 2 Genome-wide contact matrix of Frankliniella occidentalis generated using Hi-C data.Each blue square represents a chromosome, each green square represents a contig.Fifteen chromosomes were anchored under the default parameters of Juicer and 3D-DNA software.Numbers on the axes show the chromosome length in Mb.The numbers in bold at the bottom of the figure represents the chromosomes number.

Table 1 .
Statistics for chromosomal-level assembly and annotation of Frankliniella occidentalis genome.

Table 2 .
Summary of annotated protein-coding genes in Frankliniella occidentalis genome.Percentage, percentage of each item in all genes.

Table 3 .
Repeated elements identified in the Frankliniella occidentalis genome.

Table 4 .
Non-coding RNA identified in the Frankliniella occidentalis genome.