Abstract
Spiders are representative arthropods of adaptive radiation. The high-quality genomes have only been reported in several web weaver spider species, leaving the wandering spiders’ genomic information scarce. The pond wolf spider, Pardosa pseudoannulata, is a representative species in the retrolateral titial apophysis (RTA) clade. We present a chromosome-level P. pseusoannulata genome assembly of 2.42 Gb in size with a scaffold N50 of 169.99 Mb. Hi-C scaffolding assigns 94.83% of the bases to 15 pseudo-chromosomes. The repeats account for 52.79% of the assembly. The assembly includes 96.2% of the complete arthropod universal single-copy orthologs. Gene annotation predicted 24,530 protein-coding genes with a BUSCO score of 95.8% complete. We identified duplicate clusters of Hox genes and an expanded cuticle protein gene family with 243 genes. The expression patterns of CPR genes change in response to environmental stresses such as coldness and insecticide exposure. The high-quality P. pseudoannulata genome provides valuable information for functional and comparative studies in spiders.
Similar content being viewed by others
Background & Summary
Spiders are considered one of the most diverse terrestrial predators with more than 50,000 species described1. They conquer most area of the planet and are well known for their sophisticated use of silk and venom2. Moreover, spiders are important model for evolution and adaptive radiation given the varied environmental conditions of their habitats3. Taking advantage of the ever expanding genomic information, researchers are able to untangle different aspects of spider biology, with particular interest on venom, silk, and the phylogeny4,5,6,7,8,9,10,11,12,13,14,15,16,17. The increasing numbers of genomes will facilitate the phylogenomic study to unravel the roots of spiders and comparative studies to decipher the morphological and behavioural traits of spiders. However, the high-quality genomes are far from adequacy for systemic studies when the species diversity is concerned. Different from various clades of web-weaver spiders, a number of wandering spiders constitute the retrolateral tibial apophysis (RTA) clade barely building foraging webs. This spider lineage is remarkably diverse with over 25,000 species, including the popular families Salticidae (jumping spiders) and Lyosidae (wolf spiders), and thus plays an important role in spider evolution. However, no high-quality genome has been reported in the RTA clade. We here report the pond wolf spider Pardosa pseudoannulata (Fig. 2b) genome assembly at the chromosome-level to fill the gap. P. pseudoannulata predates on a broad range of insects including planthoppers and leafhoppers18 and represent an important natural enemy in agricultural ecosystem, where they encounter various environmental stresses, such as coldness, drought, and insecticides. Our previous studies have revealed that the cuticle proteins are involved in spiders’ adaptation to stresses19. Taking advantage of the genome assembly, we conducted a systemic analysis of the expanded gene family encoding the cuticle proteins with R&R Consensus (CPR) and evaluated their involvement in spiders’ adaptation to coldness during overwintering and to insecticide nitenpyram exposure.
Through k-mer analysis, the estimated genome was 1.93 Gb and highly heterozygous (Fig. 1, Table 1). Genome sequencing yielded 1032.45 Gb of clean data from PacBio, Illumina, Hi-C, and Iso-seq data (Table 2). The final genome assembly was 2.42 Gb with 94.83% of the sequences assembled into 15 chromosomes (Fig. 2a,b, Table 4). The genome completeness assessed via BUSCO was 96.2%. The size of repeats was 1.28 Gb, accounting for 52.79% of the genome (Table 5). We predicted 24,530 protein-coding genes with the BUSCO completeness 95.8% (Tables 3, 7). The number of GO items and KEGG items was 13,925 and 19,197, respectively. Comparative genomic analysis revealed 265 gene families of significant expansion and 33 of significant contraction (Fig. 3b). Fifteen most significantly expanded gene families with the highest number of genes involved the gene regulation and protein processing including the cuticle protein gene family (Fig. 4, Table 10). Two clusters of Hox genes, one complete and one incomplete, were found in the genome. Cluster A was complete and occurred in a colinear order on chromosome 7 while Cluster B occurred on chromosome 15 with ftz and Hox3 absent (Fig. 2c). We identified totally 243 genes encoding cuticle proteins containing the Rebers and Riddiford (RR) consensus, 24 categorized as RR-1 type and 219 as RR-2 type (Fig. 5). Except for four genes in scaffolds, CPR genes were distributed in all 15 chromosomes where many formed tandem arrays or gene clusters, the majority of which exhibited high sequence similarity. Using the transcriptomic data from overwintering spiders and nitenpyram-treated spiders, we investigated the transcriptional responses of CPR genes to the two stresses. The majority of CPR genes were expressed at higher level in Sept. and Oct., and along with the drop in temperature, a large number of CPR genes were down regulated and significantly to ground level in Nov. Dec. and Jan. (Fig. 6a). However, one gene Papse09523 was upregulated in the course of overwintering though at a medium expression level among all CPRs. In the nitenpyram-exposure spiders, ten CPR genes were noticeable for their different transcriptional levels from those in control spiders (Fig. 6c). Three CPR genes were significantly up-regulated and they all encoded CPRs belonging to RR-1 subfamily. Seven CPR genes were noticeably down-regulated, six of which were located in a gene cluster of 12 genes in chromosome 10.
In summary, we assembled and annotated a chromosome-level genome of P. pseudoannulata here. The expanded cuticle protein gene family is highly related to P. pseudoannulata’s adaptation to varied environmental stresses as a wandering species. The duplicated clusters of Hox genes support the whole genome duplication event during the evolution of spiders. The genome data provide a reliable resource for functional and comparative studies in Arachnida.
Methods
Sample collection and sequencing
Five adult Pardosa pseudoannulata females (NCBI: txid330961) were collected in a paddy field in Jurong City, Jiangsu Province, China in 2020. Legs and muscle attached with cuticle from prosoma were collected and stored at −80°C. Genomic DNA was extracted with QIAGEN DNeasy Blood & Tissue kit. Two short paired-end insert libraries of 350-bp sequences were constructed for survey analysis using a Truseq DNA PCR-free kit and sequenced using the Illumina NovaSeq. 6000 platform. PacBio sequencing was performed with libraries constructed with insert size of 40 kb using a SMRTbell™ Template Prep Kit 2.0 on PacBio Sequel II. In addition, Hi-C sequencing was carried out with muscle from a single female. Muscle tissues were fixed with formaldehyde and lysed. The cross-linked DNA was digested with restriction enzyme MboI, biotinylated and proximity-ligated to form chimeric junctions that were physically sheared into size of 350 bp. Chimeric fragments representing the original cross-linked long-distance physical interactions were then processed into paired-end sequencing libraries, and 150-bp paired-end reads were generated using the Illumina NovaSeq. 6000.
RNA-seq samples were prepared in several individual batches of spiders. First, the Iso-seq sample was prepared by pooling the total RNA from spiders of different developmental stages (egg, juvenile, male and female adults) at equal molar ratio. Second, overwintering samples included spiderings (cephalothoax length 0.30–0.35 cm) collected monthly from paddy fields in Jurong city, Jiangsu Province, China from September 2019 to January 2020 when the average temperature went down from 26.5 to 3.0 degrees Celsius. Three replicate samples were prepared from each month with five individuals per replicated sample. Third, nitenpyram-exposure samples included spiderlings (14–20 days post egg-sac) fed with 1% acetone (solvent control) or 27.9 mg/L nitenpyram (treatment, sublethal dose) for 14 days. Three replicate samples were prepared for each control or treatment. Total RNA was extracted with TRIzol™ Reagent. RNA libraries were constructed with TruSeq RNA v2 kit and sequenced using the PacBio Sequel II platform. PacBio sequencing and Hi-C sequencing, and RNA sequencing, were performed by Berry Genomics (Beijing, China) and BGI Genomics (Shenzhen China).
We obtained 456.40 Gb PacBio DNA reads, 299.95 Gb Illumina DNA reads, 273.23 Gb Hi-C, and 2.87 Gb Iso-Seq data for genome assembly (Table 2).
Genome survey and assembly
The Illumina reads were quality controlled using BBTools suite v38.6720. Duplicates were removed with “clumpify.sh”. Low-quality reads were discarded with “bbduk.sh” (qtrim = rl trimq = 20 minlen = 15 ecco = t maxns = 5 trimpolya = 10 trimpolyg = 10 trimpolyc = 10). The filtered reads were applied to estimating genome size. The k-mer distribution was estimated with khist.sh (BBTools). A 21-mer was selected for k-mer anlaysis with the maximum k-mer coverage cutoff set to 10,000 using Genomescope v2.021. The estimated genome was 1.93 Gb and highly heterozygous (Fig. 1). The size of repeat sequence was estimated to be 619.26 Mb, accounting for 32.10% of the estimated genome (Table 1).
PacBio sequencing reads were first corrected with NextDenovo v2.3.1 (https://github.com/Nextomics/NextDenovo) and then subjected to assembly in Raven v1.3.0 (–weaken -p 0)22. Two rounds of polishing were carried out with Illumina short reads using NextPolish v1.3.023 to improve the accuracy of the assembly. The heterozygous regions of the assembly were removed with two rounds of Purge_dups v1.2.5 (-a 70)24.
The chromosome-level assembly of the genome was generated by incorporating the Hi-C sequences. The Hi-C reads were firstly quality-controlled to remove duplicates and then aligned to the assembly using Jucier v1.6.225. Chromosome-level pseudo-chromosomes were assembled with the defaults using 3D-DNA v18092226. Resulting assembly was manually corrected using Juicebox v1.11.0825 and completed using 3D-DNA. The final genome assembly was 2.42 Gb with 94.83% of the sequences assembled into 15 chromosomes (Fig. 2a,Tables 3, 4).
Genome annotation
A de novo repeat database was initially constructed by using RepeatModeler v2.0.2a27 with LTR searching activated (-LTRStruct). The de novo repeat database was then combined with RepBase-2018102628 to obtain the reference repeat database. The repetitive elements were predicted and masked by searching against the reference repeat database using RepeatMasker v4.1.029. The size of repeats was 1.28 Gb, accounting for 52.79% of the genome (Table 5). The most abundant repeat types were unknown (25.01%), DNA elements (13.52%), simple repeats (6.95%), LTR elements (2.37%), and LINEs (2.16%).
The non-coding RNAs (ncRNAs) were identified by predicting rRNA, snRNA and miRNA against Rfam database using infernal v1.1.430 and annotating tRNA using tRNAscan-SE v2.0.931 in the genome. tRNAs of low confidence were removed using the “EukHighConfidenceFilter” script in tRNAscan-SE. In total, 6,011 ncRNAs were predicted in the genome (Table 6).
Protein-coding genes were predicted via a strategy integrating three methods (ab initio, transcript-based, and protein homology-based evidence) with the repeat-masked genome in MAKER v3.01.0332. Ab initio prediction was carried out by Augustus v3.4.033 and GeneMark-ES/ET/EP 4.68_lic34, both gene finders trained in BRAKER v2.1.535 with the mapped RNA-seq and protein orthologs via ProtHint v2.4.0. RNA-seq data were firstly mapped to the genome assembly using HISAT2 v2.2.136. Using the genome assembly as a reference, the mapped RNA-seq data were assembled into transcripts via StringTie v2.1.637. Lastly, the reference protein sequences from Stegodyphus dumicola (GCA_010614865.1), Parasteatoda tepidariorum (GCA_000365465.3), Argiope bruennichi (GCA_015342795.1), Daphnia magna (GCA_020631705.2), Ixodes scapularis (GCA_016920785.2), Trichonephila antipodiana (PRJNA627506) were downloaded from NCBI database and served as the protein homology-based evidence for MAKER. The structural annotation predicted 24,530 protein-coding genes. The average number of exon per gene is 6.1, with a mean exon length of 262.7 bp and mean intron length of 3,456.9 bp (Table 4).
Predicted proteins were firstly searched against protein databases including UniProtKB (SwissProt + TrEMBL) and NR with a very sensitive mode (–very-sensitive -e 1e-5) using Diamond v2.0.11.14938. 22,345 proteins (91.09%) matched the UniProtKB entries. The predicted proteins were then subjected to domain searching via InterProScan 5.48-83.039 against the databases Pfam40, SMART41, Superfamily42, and Conserved Domain Database (CDD)43. Structural domains of 18,021 proteins were identified through InterProScan. In addition, the predicted proteins were analysed for Gene Ontology (GO) and KEGG pathway annotation via eggNOG-mapper v2.1.544 in eggNOG v5.044. The number of GO items and KEGG items was 13,925 and 19,197, respectively (Table 7).
Phylogenetic analyses and GO/KEGG enrichment analyses
Orthologous gene families were obtained using OrthoFinder v2.3.845 from the protein-coding genes from 10 representative species including five Araneae species (A. bruennichi, P. tepidariorum, P. pseudoannulata, S. dumicola, and T. antipodiana), one Xiphosura (Tachypleus gigas), one Scorpiones (Centruroides sculpturatus), and three Acari (I. scapularis, Galendromus occidentalis, and Tetranychus urticae) (Table 9).
The phylogenetic tree of these ten species were constructed with 805 single-cope genes. The genes were firstly aligned using the strategy L-INS-I in MAFFT v7.39446 and trimmed to remove the region of low-homology using ‘-m BLOSUM90 -h 0.4’ in BMGE v1.1247. The resulting alignments were then concatenated using FASconCAT-G v1.0448. The phylogenetic tree was constructed with the concatenated alignments as supermatrix using IQ-TREE v2.0.749 after removing the genes failing the SRH (stationary, reversible, and homogeneous) model using ‘–symtest-remove-bad–symtest-pval 0.10’. Protein substitution model was set as LG, combined with the partitioning algorithm (-m MFP–mset LG–msub nuclear–rclusterf 10). Node support values were assessed using the ultrafast bootstrap and SH-aLRT (Shimodaira-Hasegawa-like approximate likelihood ratio test) algorithms.
The divergence time was estimated using MCMCTree MCMCTree (clock = 2, BDparas = 1 1 0.1, kappa_gamma = 6 2, alpha_gamma = 1 1, rgene_gamma = 2 20 1, sigma2_gamma = 1 10 1) in the package PAML v4.9j50. Fossil records were retrieved from the PaleobioDB (https://www.paleobiodb.org/navigator/) with Chelicerata (genus Paleomerus) 516.0–541.0 million years ago (mya), Arachnida (Acariformes, Pseudoprotacarus scoticus) 407.6-419.2 mya, and Parasitiformes (Deinocroton draculi) 93.5–145.5 mya.
The likelihood of gene family expansion and contraction was identified using CAFÉ v4.2.151 on the bases of the single birth-death parameter λ and a p-value threshold of 0.01. GO and KEGG enrichment of the significantly expanded gene families was analyzed using clusterProfiler v3.10.1 in R52. Among the genomes of the ten selected species, 186,060 genes were classified into 18,287 gene families, including 805 single-copy gene families and 3,428 multi-copy gene families (Table 8). In P. pseudoannulata, 22,591 out of the 24,530 genes were grouped in 10,741 families with 3,873 species-specific genes in 484 gene families (Fig. 3a,Table 9). P. pseudoannulata exceeded the other spider species in terms of the number of species-specific gene families.
Phylogenetic tree was constructed with 734 single-copy genes with 330,721 amino acid sites after IQ-TREE removed 71 single-copy genes. The UFB/SH-aLRT ratios of the branches in the maximum likelihood tree were all 100/100 with the exception for that of Centruroides sculpturatus and Stegodyphus dumicola-Pardosa pseudoannulata being 98.8/94 and 99.5/98, respectively. The number of gene families experienced expansion and contraction was 1,753 and 3,619, respectively, with 265 gene families of significant expansion and 33 of significant contraction (Fig. 3b). Fifteen most significantly expanded gene families with the highest number of genes involved the gene regulation and protein processing (Table 10). Among them, the cuticle protein and cytochrome P450 genes underwent expansion in P. pseudoannulata, which is likely to be in consistent with its adaptation to environment stresses, as also the case in T. antipodiana11. In addition, GO/ KEGG enrichment analyses of the expanded gene families further underlined the importance of cuticle protein genes as we detected the “structural constituent of cuticle” in GO and the “cutin, suberine and wax biosynthesis” in KEGG pathway enrichment (Fig. 3).
Annotation of Hox genes
According to the annotation method of Hox genes in Argiope bruennichi15, the most complete sequences of the ten arthropod Hox gene classes from spiders were chosen as the Hox gene set. A TBLASN search against the genome assembly was performed to retrieve the Hox genes in P. pseudoannulata. The genomic position of best hits (E-value < 1.00 × 10−20 and identity >60%) were compared with the AUGUSTUS gene predictions for those locations. Only the Hox gene with the longest match-length in the same genomic position was retained. In P. psuedoannulata, the Hox genes have been manually annotated, and their genomic positions were retrieved (Fig. 2c). Two clusters of Hox genes, one complete and one incomplete, were found in P. psuedoannulata, in consistence to the results from three other spiders. One of the two clusters (Cluster A) was complete and occurred in a colinear order on chromosome 7. The other cluster (Cluster B) occurred on chromosome 15 with ftz and Hox3 absent. Genes in Cluster B was divided by non-Hox genes into two subculsters as AbdB-AbdA-Ubx-Antp and lab-pd-Dfd in a reversed order on chromosome 15. In contrast to the complete cluster of Hox genes shared in the four spider genomes, the incomplete cluster occurs in a species-specific manner with ftz absent in P. tepidariorum53, Hox3, ftz, Ubx, and AbdA absent in T. antipodiana11, and AbdA and AbdB switching position in A. bruennichi15 (Fig. 2d). The Hox cluster duplication in P. pseudoannulata supports the whole-genome duplication (WGD) predicted to have occurred in the common ancestor of spiders and scorpions11 as evidenced also in three spiders (P. tepidariorum, A. bruennichi and T. antipodiana) and one scorpion (C. sculpturatus).
Annotation and characterization of cuticle protein genes
We further examined the genes involved in the significantly expanded gene family “Insect cuticle protein” and structural domain analysis suggested that most genes encoding cuticle proteins with Rebers and Riddiford (R&R) consensus (CPRs). Genes encoding CPRs were identified following the method previously described in Li et al.19 with the manual verification of the R&R consensus. The protein sequences of all CPRs were aligned using ClustalW in MEGA6 and a phylogenetic tree was constructed with maximum-likelihood method, the bootstrap value of 1000. FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) values of all CPR genes were retrieved from the transcriptomic data obtained from overwintering samples (NCBI accession: PRJNA907545) and nitenpyram-exposure samples (NCBI accession: PRJNA1015725)54. The log2(FPKM value + 1) was subjected to draw the heatmaps illuminating the transcriptional change of CPR genes in overwintering spiderlings and nitenpyram-exposed spiderlings. CPR genes with FPKM values less than 1 were considered not expressed and were excluded from analyses.
We identified totally 243 CPR genes including all of the 152 genes reported previously19. Twenty-four CPRs were categorized as RR-1 type and 219 as RR-2 type. Except for four genes in scaffolds, the majority of the CPR genes were distributed in all 15 chromosomes where many formed tandem arrays or gene clusters (Fig. 5). The CPR genes were unevenly located in chromosomes, with chromosome 4 being the most CPR-abundant (91 CPR genes), followed by chromosome 2 (45), chromosome 5 (43), chromosome 14 (25), and chromosome 10 (17), while one single CPR gene occurred in chromosomes 1, 6, 9, 11, 13, and 15 (Fig. 5a). The clustering of CPR genes in chromosomes may explain their uneven distribution, which has been reported in many arthropods55,56,57. In P. pseudoannulata chromosome 4, seven CPR gene clusters were present with the biggest cluster containing 59 genes in a tandem array. The most of CPRs clustered into two distinct branches, RR-1 subfamily and RR-2 subfamily, in the phylogenetic analysis (Fig. 5b). Four RR-1 members scattering in the RR-2 branch is probably attributed to their highly conserved flanking sequence with RR-2s. The huge number of closely located genes might be the result of gene duplication and the phylogenetic analysis supported it. The majority of the genes in tandem arrays or clusters exhibited high similarity, forming branches in the ML tree (Fig. 5b). For example, the 59 tandemly located gene in chromosome 4 closely clustered together in the ML tree. Nevertheless, some CPR genes were tandemly arrayed in chromosomes but their corresponding proteins scattered in the ML tree, such as the ones from cluster “Chr05-29-2” and “Chr05-29-4”.
As an essential component of cuticle, CPRs play important roles in the adaptation of arthropods to the habitat changes, especially to environmental stresses. P. psuedoannulata wanders in varied ecosystems including farm fields and faces complex environmental challenges, including low temperature in winter and insecticides exposure. Using the transcriptomic data from overwintering spiders and nitenpyram-treated spiders, we investigated the transcriptional responses of CPR genes to the two stresses. The majority of CPR genes were expressed at higher level in Sept. and Oct., and along with the drop in temperature, a large number of CPR genes were down regulated and significantly to ground level in Nov. Dec. and Jan. (Fig. 6a). The downregulation of CPR genes is in consistent with the observation that P. pseudoannulata did not molt during overwintering. However, one CPR gene Papse09523 was upregulated in the course of overwintering though at a medium expression level among all CPRs (Fig. 6b), and the functions of this gene in cold tolerance triggers further investigation. Interestingly, Papse16785 kept down-regulated during overwintering whereas it was significantly up-regulated after acute 0 °C exposure in our previous work (CPR12)19, suggesting that it might contribute to the different mechanisms underlying short-term and long-term coldness tolerance. In the nitenpyram-exposure spiders, ten CPR genes were noticeable for their different transcriptional levels from those in control spiders (Fig. 6c). Three CPR genes (Papse06745, Papse11013 and Papse17376) were significantly up-regulated and they all encoded CPRs belonging to RR-1 subfamily. Seven CPR genes were noticeably down-regulated and six (Papse11014-Papse11024) were located in a gene cluster of 12 genes in chromosome 10. We previously reported several CPR genes were involved in P. pseudoannulata’s responses to different stresses19. Therefore, CPR genes might respond to chemical stresses in a coordinative way. The present study provides more valuable information on the genomic distribution and transcriptional responses to stresses of CPR genes for further functional studies.
Data Records
The RNA-seq data were deposited in the SRA at NCBI for Iso-seq (SRR19534759)58, overwintering (SRR22498904-SRR22498907, SRR22498911, SRR22498913- SRR22498922)59, and nitenpyram-exposure (SRR26044769- SRR26044774)60. The genome assembly and annotation files are available in Figshare (https://doi.org/10.6084/m9.figshare.24190083)61 and GenBank under the accession JAGEOH00000000062.
Technical Validation
We detected possible contaminant sequences via blastn-like searches against the NCBI nucleotide (nt) and UniVec databases with a sequence identity of 0.8 (‘-min-seq-id 0.8’) using MMseqs. 2 v12-113e363. Sequences with over 80% hits were checked via online BLASTN analysis in the NCBI nucleotide database. Sequences with over 90% hits in the databases were considered contaminants and removed from the assembled scaffolds.
The completeness of assembly was evaluated with BUSCO V5.4.464 searching the arthropoda_odb 10 dataset (n = 1,013). The mapping rate was assessed by mapping the clean reads of Illumina and PacBio sequences to the genome assembly using Minimap2. The genome completeness assessed via BUSCO was 96.2% (Table 3).
On the basis of BUSCO analysis, the predicted protein-coding genes were identified 971 (95.8%) complete, 109 (10.8%) duplicated, 22 (2.2%) fragmented, and 20 (2.0%) missing orthologs (Table 4).
Code availability
All data processing commands and pipelines were carried out in accordance with the instructions and guidelines provided by the relevant bioinformatic softwares. There were no custom scripts or code utilized in this study.
References
World Spider Catalog, Version 24. Natural History Museum Bern (2023), online at http://wsc.nmbe.ch, accessed on June 26, 2023.
Sanggaard, K. W. et al. Spider genomes provide insight into composition and evolution of venom and silk. Nat. Commun. 5, 3765 (2014).
Zhang, J., Li, Z., Lai, J., Zhang, Z. & Zhang, F. A novel probe set for the phylogenomics and evolution of RTA spiders. Cladistics 39, 116–128 (2023).
Babb, P. L. et al. Characterization of the genome and silk-gland transcriptomes of Darwin’s bark spider (Caerostris darwini). PLoS One 17, e0268660 (2022).
Babb, P. L. et al. The Nephila clavipes genome highlights the diversity of spider silk genes and their complex expression. Nat. Genet. 49, 895–903 (2017).
Blackledge, T. A. et al. Reconstructing web evolution and spider diversification. Proc. Natl. Acad. Sci. USA 106, 5229–5234 (2009).
Bond, J. E. et al. Phylogenomics resolves a spider backbone phylogeny and rejects a prevailing paradigm for orb web evolution. Curr. Biol. 24, 1765–1771 (2014).
Coddington, J. A., Agnarsson, I., Hamilton, C. A. & Bond, J. E. Spiders did not repeatedly gain, but repeatedly lost, foraging webs. PeerJ 7, e6703 (2019).
Cerca, J. et al. The Tetragnatha kauaiensis genome sheds light on the origins of genomic novelty in spiders. Genome Biol. Evol. 13, evab262 (2021).
Escuer, P. et al. The chromosome-scale assembly of the Canary Islands endemic spider Dysdera silvatica (Arachnida, Araneae) sheds light on the origin and genome structure of chemoreceptor gene families in chelicerates. Mol. Ecol. Resour. 22, 375–390 (2022).
Fan, Z. et al. A chromosome-level genome of the spider Trichonephila antipodiana reveals the genetic basis of its polyphagy and evidence of an ancient whole-genome duplication event. Gigascience 10, giab016 (2021).
Adams, S. A. et al. Reference genome of the long-jawed orb-weaver, Tetragnatha versicolor (Araneae: Tetragnathidae). J. Hered. 114, 395–403 (2023).
Ma, L. et al. Chromosome-level genome assembly of bean flower thrips Megalurothrips usitatus (Thysanoptera: Thripidae). Sci. Data 10, 252 (2023).
Miller, J., Zimin, A. V. & Gordus, A. Chromosome-level genome and the identification of sex chromosomes in Uloborus diversus. GigaScience 12, giad002 (2023).
Sheffer, M. M. et al. Chromosome-level reference genome of the European wasp spider Argiope bruennichi: a resource for studies on range expansion and evolutionary adaptation. Gigascience 10, giaa148 (2021).
Wang, Z. et al. Chromosome-level genome assembly of the black widow spider Latrodectus elegans illuminates composition and evolution of venom and silk proteins. Gigascience 11, giac049 (2022).
Sanchez-Herrero, J. F. et al. The draft genome sequence of the spider Dysdera silvatica (Araneae, Dysderidae): A valuable resource for functional and evolutionary genomic studies in chelicerates. Gigascience 8, giz099 (2019).
Zhong, W., Tan, Z., Wang, B. & Yan, H. Next-generation sequencing analysis of Pardosa pseudoannulata’s diet composition in different habitats. Saudi J. Biol. Sci. 26, 165–172 (2019).
Li, J., Li, F., Gao, H., Zhang, Y. & Liu, Z. Characterization of cuticular proteins in CPR family in the wolf spider, Pardosa pseudoannulata, and the response of one subfamily genes to environmental stresses. Insect Biochem. Mol. Biol. 150, 103859 (2022).
Bushnell, B. BBtools. Available online: https://sourceforge.net/projects/bbmap/ (accessed on 1 October 2022) (2014).
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204 (2017).
Vaser, R. & Šikić, M. Time- and memory-efficient genome assembly with Raven. Nat. Comput. Sci. 1, 332–336 (2021).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457 (2020).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. Available online: http://www.repeatmasker.org (accessed on 1 October 2022) (2013–2015).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Chan, P. P. & Lowe, T. M. TRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol. Biol. 1962, 1–14 (2019).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).
Bruna, T., Lomsadze, A. & Borodovsky, M. GeneMark-EP plus: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom. Bioinform. 2, lqaa026 (2020).
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–769 (2016).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Finn, R. D. et al. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199 (2017).
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Letunic, I., Khedkar, S. & Bork, P. SMART: recent updates, new developments and status in 2020. Nucleic Acids Res. 49, D458–D460 (2021).
Wilson, D. et al. SUPERFAMILY-sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386 (2009).
Marchler-Bauer, A. et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45, D200–D203 (2017).
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Criscuolo, A. & Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10, 210 (2010).
Kueck, P. & Longo, G. C. FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. Front. Zool. 11, 81 (2014).
Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol. 30, 1987–1997 (2013).
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).
Schwager, E. E. et al. The house spider genome reveals an ancient whole-genome duplication during arachnid evolution. BMC Biol. 15, 62 (2017).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Chen, E. H. et al. Genome-wide annotation of cuticular proteins in the oriental fruit fly (Bactrocera dorsalis), changes during pupariation and expression analysis of CPAP3 protein genes in response to environmental stresses. Insect Biochem. Mol. Biol. 97, 53–70 (2018).
Cornman, R. S. et al. Annotation and analysis of a large cuticular protein family with the R&R Consensus in Anopheles gambiae. BMC Genomics 9, 22 (2008).
Yang, C. H. et al. Identification, expression pattern, and feature analysis of cuticular protein genes in the pine moth Dendrolimus punctatus (Lepidoptera: Lasiocampidae). Insect Biochem. Mol. Biol. 83, 94–106 (2017).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP378269 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP410992 (2022).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP459973 (2023).
Yu, N. & Liu, Z. Pardosa pseudoannulata genome assembly and annotation. figshare https://doi.org/10.6084/m9.figshare.24314020 (2023).
NCBI Assembly https://identifiers.org/ncbi/insdc:JAGEOH000000000 (2023).
Steinegger, M. & Soeding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant number: 32172482, 31972296, and 31772185).
Author information
Authors and Affiliations
Contributions
N.Y., J.L. and Z.L. designed the research. J.L., H.B., Y.Z., Z.Y. and F.L. contributed to the sample collection. J.L. and F.L. conducted experimental assays. N.Y., J.L., J.W., J.J. and F.Z. analyzed and visualized data. N.Y. wrote the draft manuscript. N.Y., J.L. and Z.L. revised the manuscript. All co-authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yu, N., Li, J., Bao, H. et al. Chromosome-level genome of spider Pardosa pseudoannulata and cuticle protein genes in environmental stresses. Sci Data 11, 121 (2024). https://doi.org/10.1038/s41597-024-02966-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-02966-1