The first chromosome-level genome assembly of Entomobrya proxima Folsom, 1924 (Collembola: Entomobryidae)

The Entomobryoidea, the largest superfamily of Collembola, encompasses over 2,000 species in the world. However, the lack of high-quality genomes hinders our understanding of the evolution and ecology of this group. This study presents a chromosome-level genome of Entomobrya proxima by combining PacBio long reads, Illumina short reads, and Hi-C data. The genome has a size of 362.37 Mb, with a scaffold N50 size of 57.67 Mb, and 97.12% (351.95 Mb) of the assembly is located on six chromosomes. The BUSCO analysis of our assembly indicates a completeness of 96.1% (n = 1,013), including 946 (93.4%) single-copy BUSCOs and 27 (2.7%) duplicated BUSCOs. We identified that the genome contains 22.16% (80.06 Mb) repeat elements and 20,988 predicted protein-coding genes. Gene family evolution analysis of E. proxima identified 177 gene families that underwent significant expansions, which were primarily associated with detoxification and metabolism. Moreover, our inter-genomic synteny analysis showed strong chromosomal synteny between E. proxima and Sinella curviseta. Our study provides valuable genomic information for comprehending the evolution and ecology of Collembola.

reads, and Hi-C data.We annotated repeats, non-coding RNAs (ncRNAs), and protein-coding genes (PCGs), and conducted gene family evolution analysis.Moreover, we explored the interspecific chromosomal variation between the two Entomobryidae species, E. proxima and Sinella curviseta.The high-quality genome of E. proxima is an important milestone in our understanding of Entomobryoidea and will contribute to the study of collembolan evolution and ecology.

Methods
Sample collection and sequencing.The E. proxima samples used in this study were collected in June 2020 from the Purple Mountain in Nanjing, China (32.06 °N, 118.83 °E).Adult individuals were collected by an entomological aspirator, washed in phosphate-buffered saline, and instantly frozen using liquid nitrogen.A total of 100, 30, 50, and 50 individuals were used for PacBio, genome survey, Hi-C, and transcriptome sequencing, respectively.To obtain the requisite amount of DNA/RNA for sequencing, we utilized a pooled extraction approach, in which multiple individuals were combined prior to nucleic acid extraction.
Genomic DNA and RNA from pooled specimens were extracted using the DNeasy Blood & Tissue Kit and TRIzoTM Reagent, following the manufacturer's instructions.PCR-free short-read libraries of 150 bp paired-end read with a 350 bp insert size were generated using the Truseq DNA PCR-free Kit.The Hi-C sequencing was carried out by digesting extracted DNA with the Mbol restriction enzyme.We utilized the Illumina NovaSeq 6000 platform to sequence all short-read libraries.A library with a 30 kb-insert size was prepared using the SMRTbell Template Prep Kit 1.0-SPv3 for PacBio sequencing.To generate the library, we used the PacBio Sequel II platform, which employs the PacBio CLR mode.Berry Genomics (Beijing, China) carried out all library construction and sequencing.Finally, we obtained 292.18 Gb of sequencing data, comprising 206.73 Gb (570.50×) of PacBio reads, 45.89 Gb (126.62×) of Illumina reads, 29.83 Gb (82.32×) of Hi-C data, and 9.73 Gb of transcriptome data (Table 1).The raw PacBio long reads had a scaffold N50 and an average length of 29.41 and 24.08 kb, respectively.

Genome assembly.
We performed quality control on raw Illumina data using BBTools v38.82 10 , which included the "clumpify.sh"script to remove duplicate reads.We trimmed low-quality reads using the "bbduk.sh"script, which consisted of base quality trimming of both ends (>Q20), length filtering (>15 bp), polymer A/G/C trimming (>10 bp), and correction of overlapping paired reads.
The primary assembly of PacBio long reads was generated using Flye v2.8.3 11 , with a minimum overlap of 3,000 between reads.We then performed one round of self-polishing on the assembly.NextPolish v1.1.0 12was used to polish the primary assembly with Illumina reads in two rounds.Redundant contigs were eliminated using Purge_Dups v1.2.5 13 with a haploid cutoff set at 50 for identifying contigs as haplotigs ("-s 50").We used Minimap2 v2.17 14 for read mapping during the above "remove haplotigs" and "short-read polishing" steps.We aligned Hi-C reads to the assembly after performing quality control using Juicer v1.6.2 15 .Subsequently, we anchored the contigs onto chromosomes using 3D-DNA v180922 16 .To ensure accuracy, we manually reviewed and corrected any errors using Juicebox v1.11.08 16 .We detected possible contaminants using MMseqs. 2 v11 17 , which performed BLASTN-like searches based on the NCBI nucleotide and UniVec databases with a sequence identity of 0.8 ('-min-seq-id 0.8').Additionally, we utilized blastn (BLAST + v2.11.0 18 ) against the UniVec database to specifically detect vector contaminants.We considered that sequences with over 90% hits in the aforementioned database likely contained contaminants.Sequences with over 80% hits were checked again via online BLASTN analysis in the NCBI nucleotide database.We then removed potential bacterial and human contamination from the assembled scaffolds.Most of the identified contaminants were bacterial, including Rickettsia and Sphingobacterium.The final chromosome-level genome assembly of E. proxima had a size of 362.37 Mb, comprising 461 scaffolds and 2,007 contigs, with the scaffold and contig N50 sizes of 57.67 Mb and 0.44 Mb, respectively.Among them, 1,542 contigs (97.12%, 351.95 Mb) were anchored into six chromosomes with lengths ranging from 48.13 to 69.80 Mb and the GC content was 35.93% (Table 2; Fig. 1).The genome size and GC content of E. proxima (362.37 Mb and 35.93%) were smaller than that of S. curviseta (363.44 Mb and 37.52%) (Table 3).

Phylogeny and gene family evolution.
and Holacanthella duospinosa) (Table 4).We used OrthoFinder v2.5.2 41 to infer orthogroups (gene families), after eliminating redundant isoforms.For sequence alignment, we used the Diamond with ultra-sensitive mode ("-S diamond_ultra_sens").We assigned 143,727 (89.6%) genes to 21,690 orthogroups, of which 3,924 were shared by all eight species and 2,252 were single-copy genes (Table S4).In E. proxima, 19,453 genes (92.7%) were contained in 12,705 gene families, of which 389 families and 1,886 genes were specific to this species.We aligned single-copy orthologues using MAFFT v7.450 42 with the high-accuracy LINS-I strategy.We then removed alignment gaps using trimAl v1.4.1 43 with the "automated1" parameter.Finally, we reconstructed the phylogenetic tree on the single-copy orthologs using IQ-TREE v2.07 44 with the LG site-homogeneous model.We employed MCMCTree in PAML v4.9j 45 to predict species divergence time and nodes were calibrated using two fossils obtained from the PBDB database (https://www.paleobiodb.org/navigator/):the most common ancestor of Diplura and Collembola (407.6-410.8Ma) and Entomobryomorpha (272.3-279.3Ma).After filtering out 244 single-copy genes, we performed phylogenetic reconstruction using IQ-TREE based on 2,008 remaining genes.The analysis showed that E. proxima is closely related to S. curviseta, and the divergence time of these species was approximately 47.01-51.94Mya (Fig. 2c).Our phylogenetic results were consistent with previous studies, supporting Orchesellidae (O.cincta) as a sister group to Entomobryidae (E.proxima, and S. curviseta) 7,46,47 (Fig. 2c).
We used CAFÉ v4.2.1 48 to estimate the gene family evolution (expansion or contraction) based on the generated phylogenetic tree.We identified 1,351 expanded and 1,547 contracted gene families in E. proxima, including 223 gene families that underwent rapid evolution (177 expansions and 46 contractions).The significantly expanded families included the Leucine-rich repeat domain superfamily, zinc finger, Cytochrome P450, and other families that play important roles in the adaptive evolution of E. proxima (Table S5; Fig. 2b).Subsequently, we performed functional enrichment (GO and KEGG) analysis on PCGs from significantly expanded families using ClusterProfiler v3.10.1 49 with default parameters.The enrichment of GO and KEGG in rapidly expanding families further indicates their function in the regulation of hormone levels, oxidoreductase activity, metabolic, The values labeled at terminals denote the number of significantly expanded and contracted gene families."1:1:1" represents universal single-copy genes in all species, "N:N:N" represents multi-copy genes, "others" represents unclassified orthologues, and "unassigned" represents orthologues that cannot be assigned to any orthogroups.and catabolic processes, among others (Fig. 3a,b).We uploaded the complete enrichment CSV tables to Figshare (https://doi.org/10.6084/m9.figshare.23861901).
Chromosome synteny.To investigate interspecific chromosomal evolution between E. proxima and S. curviseta, we used MMseqs. 2 with an e-value of 1e-5.Syntenic blocks were identified using MCScanX 50 and the repeat content, gene density, and GC content on individual pseudochromosomes using TBtools 51 .Comparing the genomes of E. proxima and S. curviseta, we identified 341 syntenic blocks that contained 14,473 collinear genes (45.04%).The average number of genes per block was 42, while a notable 12.02% (41 blocks) contained over 100 collinear genes.We have uploaded the syntenic blocks results containing each block's details to Figshare (https:// doi.org/10.6084/m9.figshare.23861901).Notably, our analysis revealed that the ChrX of S. curviseta shares strong chromosomal syntenic relationships with Chr6 of E. proxima (Fig. 3c).Other chromosomes (1-5) of E. proxima matched with their corresponding chromosomes in S. curviseta, although we did observe some regions lacking homology.These differences may be attributed to the significant divergence time (47.01-51.94Mya) between the two species.

Data records
The raw sequencing data and genome assembly of Entomobrya proxima have been deposited at the National Center for Biotechnology Information (NCBI).The PacBio, Illumina, Hi-C, and transcriptome data can be found under identification numbers SRR15910088-SRR15910092 [52][53][54][55][56] .The assembled genome has been deposited in the NCBI assembly with the accession number GCA_029691765.1 57 .Additionally, the results of annotation for repeated sequences, gene structure, and functional prediction have been deposited in the Figshare database 58 .

Technical Validation
Two methods were used to evaluate the quality of the genome assembly.Firstly, we assessed assembly completeness using BUSCO v3.0.2 59 with the reference arthropod gene set (n = 1,013).The final genome assembly showed a BUSCO completeness of 96.1%, consisting of 946 (93.4%) single-copy BUSCOs, 27 (2.7%)duplicated BUSCOs, 7 (0.7%) fragmented BUSCOs, and 33 (3.2%) missing BUSCOs.Secondly, we calculated the mapping rate as a measure of assembly accuracy.The mapping rates for PacBio, Illumina, and RNA reads were 99.93%, 88.83%, and 86.72%, respectively.These evaluations collectively reflect the high quality of the genome assembly produced in this study.

Fig. 1
Fig.1Genome-wide chromosomal heatmap of Entomobrya proxima, with each chromosome and contig framed in blue and green, respectively.

Fig. 2
Fig. 2 Genome characteristics, phylogeny, and gene family evolution of Entomobrya proxima.(a) From the outer ring to the inner ring are the distributions of chromosome length, GC content, gene density, TEs (DNA, SINE, LINE, and LTR), and simple repeats.(b) The top 20 gene families that significantly expanded in E. proxima.(c) The phylogeny and gene family changes among eight arthropod species, with node values representing the 95% highest probability densities of divergence times (unit, 100 Ma).The values labeled at terminals denote the number of significantly expanded and contracted gene families."1:1:1" represents universal single-copy genes in all species, "N:N:N" represents multi-copy genes, "others" represents unclassified orthologues, and "unassigned" represents orthologues that cannot be assigned to any orthogroups.

Table 1 .
Statistics of the sequencing data used for genome assembly.