Chromosome-level genome assembly of the predatory stink bug Arma custos

The stink bug Arma custos (Hemiptera: Pentatomidae) is a predatory enemy successfully used for biocontrol of lepidopteran and coleopteran pests in notorious invasive species. In this study, a high-quality chromosome-scale genome assembly of A. custos was achieved through a combination of Illumina sequencing, PacBio HiFi sequencing, and Hi-C scaffolding techniques. The final assembled genome was 969.02 Mb in size, with 935.94 Mb anchored to seven chromosomes, and a scaffold N50 length of 135.75 Mb. This genome comprised 52.78% repetitive elements. The detected complete BUSCO score was 99.34%, indicating its completeness. A total of 13,708 protein-coding genes were predicted in the genome, and 13219 of them were annotated. This genome provides an invaluable resource for further research on various aspects of predatory bugs, such as biology, genetics, and functional genomics.


Methods
Sample collection and rearing.The population of A. custos used in this study originated from a colony collected in the suburb of Kunming, Yunnan Province, China.These bugs have been maintained in our laboratory for more than 20 generations.They were fed with larvae of the yellow mealworm Tenebrio molitor, the greater wax moth Galleria mellonella, and the fall armyworm S. frugiperda.Cages measuring 40 cm × 40 cm × 40 cm, constructed with Nylon netting (44 × 32 mesh) on all sides, were used to rear the bugs.Each cage housed approximately 100 bugs.Soybean plants were also provided in the cage for feeding and perching.The bugs were reared at a constant temperature of 25 ± 1 °C, 70 ± 5% relative humidity, and a photoperiod of 14 L:10D.
Sequencing.Genomic DNA was extracted from one newly emerged male adult using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany).Total RNA was isolated from various adult tissues including different glands of the salivary venom apparatus (anterior main gland, posterior main gland, and accessary gland), gut and  For short-read genomic and transcriptome sequencing, the library with an insert size of 350 bp was constructed using the NEBNext Ultra DNA Library Prep Kit (Illumina, San Diego, CA, USA) following manufacturer's recommendations.This library was then sequenced on the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA).The genomic short-read data yielded from the Illumina NovaSeq 6000 platform amounted to 74.86 Gb with a Q20 value of 96.56% and a Q30 value of 90.84% (Table 1).A total of 72.86 Gb transcriptomic data were generated, which have Q20 values over 96.56% and Q30 values more than 90.84%.
For PacBio HiFi long-read sequencing, the SMRTbell library was prepared with the SMRTbell Express template preparation kit 2.0 (Pacific Biosciences, Menlo Park, CA) and subsequently sequenced using the Sequel II Sequencing Kit 2.0 with SMRT Cell 8 M Tray on a PacBio sequel II instrument (Pacific Biosciences, Menlo Park, CA).In total, 34.41 Gb high-quality HiFi reads (34.85 × coverage) were obtained with an average length of 14.96 kb and an N50 length of 15.18 kb (Table 1).
The Hi-C library was generated using the restriction endonuclease Mbol following the standard protocol described previously 20 , which was sequenced on the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA) using a 150-bp paired-end strategy.A total of 163.62 Gb (165.72 × coverage) of raw data was generated.
Genome survey.To ensure data quality, adapter sequences and low-quality reads were removed with fastp v0.21.0 21 .The resulting clean reads were used to generate a histogram of the 17-mer distribution with Jellyfish v2.2.7 with parameters 'count -g generators -G 4 -s 5 G -m 17 -C -t 10' 22 (Fig. 1), followed by calculation of   genome heterozygosity.Based on these analyses, the estimated genome size was determined to be 987.35 Mb, with a heterozygosity of 0.80%.

Genome assembly.
The PacBio HiFi reads were utilized to assemble the genome into contigs using hifiasm v0.16.1 23 .The assembled draft genome was polished by employing the genomic short-reads generated by Illumina NovaSeq 6000 sequencer with the NextPolish v1.4.0 24 .To identify and remove potential contaminant sequences, Kraken2 was employed against a custom database 25 .A total of 137 contigs were identified as bacteria and subsequently eliminated.The resulting draft genome was 969.02 Mb with a contig N50 of 2.11 Mb, and the GC content of 33.18% (Table 2).
Hi-C scaffolding.The raw HiC data were processed using Hi-C-Pro v2.8.0 26 , followed by quality control with fastp v0.21.0 21 .The resulting data were aligned to the draft genome assembly utilizing bowtie 2 v2.2.3 27 to obtain the uniquely mapped paired-end reads.Among the 8,661,026 reads, 4,330,513 reads were paired, with a total paired ratio of 38.70%.And a total of 1,470,719 reads were uniquely mapped to the genome, with an effect rate of 33.96%, representing valid interaction pairs.These valid interaction pairs were used to anchor the assembled contigs to near-chromosomal level using the Allhic v0.9.8 28 .Then, juicebox v1.11.08 29 was employed for manual correction based on chromosome interaction strength, ultimately resulting a chromosome-level genome.After curation, a total of 935.94 Mb of contigs, accounting for 96.58% of the assembled draft genome, were anchored into seven chromosomes, ranging from 77.33 Mb to 234.11 Mb (Table 3).The number of anchored chromosomes matched the result of chromosome karyotype analysis following the previously reported method 30 (Fig. 2).The final genome exhibited an N50 of 135.75 Mb.A genome-wide chromatin interaction HiC heatmap was constructed using the ggplot2 software in the R package.According to the heatmap, all chromosomes were clearly distinguishable from each other (Fig. 3).The Advanced Circos tool implanted in TBtools v1.098765 31 was used to visualize the landscape of the chromosomes (Fig. 4).

Genome annotation.
A combined strategy of homology alignment and de novo search was applied to identify repetitive elements in the genome.Tandem repeats were detected using Tandem Repeats Finder (TRF) v4.09 32 .
Repetitive elements homologous to those available in the Repbase28.06 33were identified with RepeatMasker v4.1.0and RepeatProteinMask v4.1.0 34.In addition, a de novo repetitive elements database was generated using LTR_FINDER v1.0.6 35 , RepeatScout v1.0.5 36 , and RepeatModeler v2.0.1 37 .The resulting repeat sequences with lengths greater than 100 bp and gap 'N' less than 5%, obtained from both two strategies, were combined to construct the raw transposable element library.This library was then processed by UCLUST algorithm 38 to yield a non-redundant library, followed by DNA-level repeat identification using RepeatMasker v2.0.1 37 .The results indicated that the genome contained 52.78% repetitive elements, most of which were long terminal repeat (LTR) retrotransposons, representing 40.05% of the genome (Table 4).For non-coding RNA (ncRNAs) annotation, the transfer RNAs (tRNAs) were predicted using tRNAscan-SE v1.4 39 .As ribosomal RNAs (rRNAs) are highly conserved, they were predicted by searching against selected rRNA sequences from closely related species as references using the BLAST v2.2.26 40 .Other ncRNAs, including micro RNAs (miRNAs) and small nuclear RNAs (snRNAs), were identified by searching against the Rfam database v14.1 41 using the Infernal v1.1.2 42.Overall, 20,337 tRNAs, 1,556 rRNAs, 2,790 miRNAs and 596 snRNAs were predicted, resulting in a total of 25,279 ncRNAs (Table 5).A combined three-pronged strategy, involving de novo prediction, homology-based gene prediction, and transcriptome-based prediction, was employed to annotate the genes in the genome.De novo gene models were generated by multiple programs, namely Augustus v3.3.3 43 , GlimmerHMM v3.0.4 44 , SNAP v2013.11.29 45 , Geneid v1.4 46 , and Genscan v1.0 47 .For homology-based prediction, protein sets from five bugs including Halyomorpha halys 48 , Nesidiocoris tenui 49 , Oncopeltus fasciatus 50 , Rhodnius prolixus 51 and Apolygus lucorum 52 were downloaded from Insectbase 2.0 53 on January 3, 2022.These protein sets were aligned to the assembled genome using TblastN v2.2.26 40 with an E-value threshold of ≤1e −5 .The matching proteins from these bugs were used to predict the gene structure of the assembled genome with GeneWise v2.4.1 54 .For transcriptome-based prediction, raw reads from five transcriptomic libraries were subjected to quality control with fastp v0.21.0 21 .After eliminating adapter sequences and low-quality reads with Trimmomatic v1.4 55 , clean data were assembled into transcripts using Trinity v2.11.0 56 and StringTie2 v2.1.6 57.The candidate coding regions in these transcripts were predicted using TransDecoder v5.5.0 56 , which is implemented in the Trinity software.The resulting protein sequences were used to predict the gene structures following the procedure as described for homology-based prediction.In addition, the clean transcriptomic data were aligned to the assembled genome using HISAT2 v2.2.1 58 to identify the exons and splice sites, and these were used to extract the gene structures using PASA v2.4.1 59 .A non-redundant reference gene set was generated by merging genes predicted by the three strategies with EVidenceModeler (EVM) v1.1.1 60 .The gene models were further updated with PASA v2.4.1 59 to identify untranslated regions.Finally, the final comprehensive gene set was generated, resulting in a total of 13,708 protein-coding genes (Table 6).These genes had an average gene length of 25,698.42bp.The average lengths of their coding sequence (CDS), exon, and intron length were 1,400.68bp, 192.17 bp, and 3,863.69bp, respectively.On average, each gene contained 7.29 exons (Table 7).

Class
The annotation of the protein-coding genes was performed using BLAST v2.2.26 40 against SwissProt and National Center for Biotechnology Information (NCBI) non-redundant (Nr) database with DIAMOND v2.2.22 61 , parameters used '-ultra-sensitive -max-target-seqs. 1 -evalue 1e −5 ' with a threshold of E-value ≤ 1e −5 .The motifs and domains present in the predicted proteins encoding by these genes were annotated using InterProScan v86.0 with parameters '-disable-precalc, -goterms, -pathways' and Pfam 62 .Additionally, these genes were classified into functional categories based on KEGG 63 and GO 64 with a threshold of E-value ≤ 1e −5 .Overall, 13,219 predicted genes were annotated using the databases of Nr, SwissProt, InterProScan, Pfam, KEGG and GO, representing 96.43% of the total gene set (Table 8).

Data records
The raw data of Illumina short reads, PacBio HiFi long reads and Hi-C reads for assembling the genome of A. custos, as well as the transcriptome Illumina sequencing data for genomic annotation, have been deposited in the NCBI SRA (Sequence Read Archive) database under BioProject number PRJNA1001510.Illumina sequencing data for genome survey can be accessed and downloaded with accession number SRR25498178 65 .PacBio sequel II sequencing data for genome assembly can be accessed and downloaded with accession number SRR25503034 66 .Hi-C sequencing data can be accessed and downloaded with accession number SRR25518321 67 .Transcriptome sequencing data for genome annotation can be accessed and downloaded from NCBI SRA database (https://identifiers.org/ncbi/insdc.sra:SRP453032) 68.The genome sequence has been deposited in Genbank under the accession number JBBAGI000000000 (https://www.ncbi.nlm.nih.gov/nuccore/JBBAGI000000000) 69.The final chromosome assembly, genome structure annotation, amino acid sequences and CDS sequences data are available at the Figshare database (https://doi.org/10.6084/m9.figshare.25284943) 70.

technical Validation
The accuracy of the assembled genome was assessed using two methods.Firstly, the clean Illumina genomic short reads were aligned back to the genome by Burrows-Wheeler Aligner (BWA) v.0.7.12-r1039 71 .Approximately 97.81% of the short reads were successfully aligned to the genome, providing a genome coverage of 99.95%.The heterozygous and homozygous nucleotide polymorphisms (SNPs) in the genome were 0.407191% and 0.00011%, respectively.The results indicate a high accuracy of the genome assembly.Secondly, the accuracy of the assembled genome was evaluated using Merqury v1.4 72 .A quality value of 46.78 was obtained, affirming the base-level accuracy genome assembly.The completeness of the assembled genome was evaluated using three methods.Firstly, Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.7 (-l insecta_odb10 -m genome) 73 was employed.The results showed that the complete and fragment scores were 99.34% and 0.22%, respectively.Among the retrieved complete single-copy genes, only 2.3% of them are duplicated.Secondly, Core Eukaryotic Genes Mapping Approach (CEGMA, v2.5) 74 was employed.Among the 248 most highly conserved core eukaryotic genes (CEGs) within CEGMA, 230 CEGs were successfully assembled, accounting for 92.74%, and 222 CEGs were complete, accounting for 89.52%.Thirdly, LTR Assembly Index (LAI) was assessed using LTR_retriever v. 2.9.0 75 , resulting in a value of 8.44.These results indicated a high level of completeness in the genome assembly.

Fig. 1
Fig.1The 17-mer analysis of the genome of Arma custos.The X-axis represents the k-mer depth.The Y-axis indicates the k-mer frequency for a given depth.

Fig. 2
Fig. 2 Karyotype analysis of Arma custos reveals a chromosome count of seven.The chromosomes from two nuclei are shown.

Fig. 3
Fig. 3 Heatmap of the Hi-C assembly of Arma custos.The interaction intensity of Hi-C links represents by colors shown in the left bar, ranging from yellow (low) to red (high).

Fig. 4
Fig. 4 Overview of the genome characteristics of Arma custos in a circos plot.(a), length of chromosomes at the Mb scale; (b), gene density per Mb; (c), CG content per Mb.

Table 1 .
Statistics of sequencing data for genome assembly and annotation.

Table 2 .
Statistics of the Arma custos genome assembly.

Table 3 .
Summary of the assembled seven chromosomes of Arma custos.

Table 4 .
Summary of repetitive sequences identified in the genome of Arma custos.

Table 5 .
Summary of non-coding RNAs predicted in the genome of Arma custos.

Table 8 .
Summary of functional annotation of protein-coding genes encoded in genome of Arma custos.