Background & Summary

The Alashan Ground Squirrel (Spermophilus alashanicus), part of the Rodentia order and Sciuridae family, is a prevalent rodent species native to the Helan Mountains in China1. It thrives in forest grasslands and desert plains, predominantly consuming plants and insects. Characterised by its large, protruding eyes, degenerated outer ears, and hibernating behaviour (Fig. 1), it shares a close phylogenetic relationship with Spermophilus dauricus2. Although assessed for the IUCN Red List of Threatened Species in 20163 (https://doi.org/10.2305/IUCN.UK.2016-3.RLTS.T20478A22265832.en. Accessed on 09 December 2022), research on this species is limited due to its unique distribution, leaving its environmental adaptation mechanisms largely unexplored. Current studies are confined to individual identification4 and habitat suitability analysis5.

Fig. 1
figure 1

Alashan ground squirrels in the Helan Mountains. (a) and (b) was taken on the eastern slope. (c) and (d) taken for the western slope.

In the context of global climate change and its ecological repercussions, understanding the molecular mechanisms underlying adaptation to changing environments is crucial. However, until this study, limited molecular information has been available for the Alashan Ground Squirrel, particularly in metagenomic and transcriptomic domains, hindering the understanding of their biological mechanisms. This study introduces extensive metagenomic and transcriptomic datasets derived from high-throughput sequencing of squirrel specimens from different slopes of Helan Mountains. Specifically, we collected transcriptomic data from five different tissue types, including heart, liver, cecum, muscle, and blood, and metagenomic data from faecal contents tissues.

The Helan Mountains range, extending in a rare north-south direction, is a pivotal geographical feature dividing Northwest China6. The west slope, part of the Inner Mongolia Helan Mountains National Nature Reserve, is characterised by a gentle terrain, humid climate, and lush vegetation. Conversely, the east slope, falling under the Ningxia Helan Mountains National Nature Reserve, is noted for its steep incline, dry climate, high temperatures, and sparse vegetation. This dichotomy makes the area an ideal model for understanding how the squirrel responds to environmental changes. Especially, the transcriptional data can reflect the overall molecular response of different tissues, while the metagenomic data can reveal the metabolic and bacterial interactions when living in different environments.

Gut microbes play important roles in host health, such as immunity7, nutrient absorption8, and behaviour9,10,11. Different environmental pressures necessitate varying dietary and energy needs for animals within the same species, leading to corresponding changes in their gut microbiota12,13. At present, the research on rodents mainly focuses on experimental animals14, while the research on wild rodents is relatively limited. To better understand the functional interplay between gut microbes and their environment, we investigated both the metagenomics and transcriptomes of these squirrels. Our study provides a valuable resource for comprehending the role of gut microbiota in wild rodents.

This study provides the first comprehensive metagenomic and transcriptomic datasets of the Alashan Ground Squirrel. By bridging the knowledge gap in understanding the molecular information of this species, our aim is to provide insights into its adaptation to the environment and contribute to a better understanding of the impact of global climate change on the ecological environment.

Methods

All procedures were carried out in accordance with the legal requirements and regulations of the Animal Experiment Ethics Committee of Northeast Forestry University (NO.20230271). All experimental procedures were approved by the Animal Care and Use Committee of Northeast Forestry University and were performed within the scope of legal requirements and regulations.

Sample collection

The sample collection work was led by the government to promote the prevention and control of grassland pests in 2022 (https://www.forestry.gov.cn/main/102/20220126/141650500484904.html). To explore the diversity of the squirrels, we deployed traps near the burrows in the six alluvial diluvial fan areas on both the eastern (105.34 E, 38.34 N) and western slope (105.83 E, 38.78 N) of Helan Mountains. The traps were carefully placed at 7:00 am, approximately four hours prior to capturing the squirrels. The procedure of live trapping refers to the operation of bank voles15,16. We analysed captures in western slope (n = 10) and eastern slope (n = 10). Fig. 2 shows the area where Alashan Ground Squirrels were captured.

Fig. 2
figure 2

Helan Mountain capture areas for Alashan Ground Squirrels. The six regions are Helankou, Maliankou, Yushugou, Harau, Fang Jiatian, South Temple. The red-covered part in the lower right corner is Helan Mountains.

For the collection of samples, we administered 5 mg/kg ethyl acetate (Xilong Scientific, CN) to anaesthetise the animals. Each specimen was assigned a unique identification number, and relevant data including weight, length, and location were recorded and shown in Table 1. Within five minutes of sacrifice, TRIzol reagent (Thermo Fisher Scientific, USA) was added to the tissues after blood collection at a ratio of 2:7. We harvested fresh heart, liver, cecum, and muscle tissues and immediately stored them in RNA extraction solution (Solarbio, CN). The contents of the cecum were collected and placed in an Eppendorf tube. Upon returning to the laboratory, all collected samples were stored at −80 °C before DNA and RNA extraction.

Table 1 Information on the individuals of Alashan Ground Squirrels.

Sample preparation and RNA extraction

Approximately 50–100 mg of each tissue was taken and ground to powder in liquid nitrogen. The resulting powder was transferred to a centrifuge tube containing 1 mL of MJzol Reagent (Majorbio, CN) at a ratio greater than 10:1. The sample was thoroughly vortexed and centrifuged at 12,000 rpm for 5 minutes at 4 °C. The supernatant was then transferred to a new tube.

To isolate RNA, chloroform (Thermo Fisher Scientific, USA) was added to the supernatant at a ratio of 200 μL of chloroform per 1 mL of MJzol Reagent. The sample was vortexed for 15 seconds and allowed to stand at room temperature for 3 minutes. It was then centrifuged at 12,000 rpm for 15 minutes at 4 °C, resulting in three distinct layers: a rose-red organic layer at the bottom, a white intermediate layer, and a colourless aqueous layer at the top. The RNA was primarily present in the aqueous phase, which was transferred to a new tube.

Next, 10 μL of magnetic beads (Morck, CN) were added to the aqueous phase. The sample was vortexed for 15 seconds to disperse the beads and then allowed to stand at room temperature for 5 minutes. The tube was then placed on a magnetic stand for 3 minutes, after which the supernatant was discarded. The beads were washed by adding 500 μL of Wash Buffer (Majorbio, CN), vortexing for 15 seconds, and placing the tube on the magnetic stand for 3 minutes. Finally, 45 µL of the RNA solution was transferred to an RNase-Free tube for further analysis.

Total RNA was extracted using TRIzol® Reagent (Solarbio, CN) according to the manufacturer’s instructions (Thermo Fisher Scientific, CN). The purity and integrity of the extracted RNA were assessed by the 2100 Bioanalyser (Agilent, USA), and the concentration was measured using the NanoDrop ND-2000 (Thermo Fisher Scientific, USA). RNA samples of high quality were selected for library construction based on the following criteria: OD260/280 ratio of 1.8–2.2, OD260/230 ratio of ≥ 2.0, RNA integrity number (RIN) of ≥8.0, 28 S:18 S ratio of ≥ 1.0, and total RNA quantity of >1 μg.

RNA library construction and sequencing

RNA purification, reverse transcription, library construction, and sequencing were performed at Majorbio Bio-pharm Biotechnology Co., Ltd. (Shanghai, CN) according to the manufacturer’s instructions (Illumina, USA). The Illumina TruSeqTM RNA preparation Kit (Illumina, USA) was used with 1 μg of total RNA to prepare the library. Briefly, poly(A) mRNA was selected using oligo(d)T beads (Invitrogen, USA) and fragmented using fragmentation buffer. The Illumina platform is designed to sequence short sequence fragments. The enriched mRNA, being a complete RNA sequence with an average length of several kb, needs to be randomly fragmented by adding 2% fragmentation buffer and selecting appropriate conditions to randomly fragment the mRNA into small fragments of about 300 bp. Using mRNA as a template, one-strand cDNA was reversely synthesised, followed by second-strand synthesis, using the SuperScript double-stranded cDNA synthesis kit (Invitrogen, UK) and random hexamer primers (Illumina, USA) to form a stable double-stranded duplex strand cDNA. Then, according to Illumina’s library construction protocol, the double-stranded cDNA structure has a sticky end. The End Repair Mix was added to make it blunt-ended, and then an A base is added to the 3′ end to connect the Y-shaped adapter. The adapter-ligated products were purified and fragment sorted, and the library was size-selected on a 2% Low Range Ultra Agarose gel to obtain a 300 bp cDNA target fragment, followed by 15 cycles of PCR amplification and purification with 2 U/μL Phusion DNA polymerase (NEB) to obtain the final library. The Qubit 4.0 (Thermo Fisher Scientific, USA) was used as a quantitative, proportional mixing machine. The cBot progressed through PCR expansion (T100 Thermal Cycler, USA) and generated clusters. Finally, the RNA-seq sequencing library was sequenced using the Illumina Novaseq 6000 platform with 2 × 150 bp read length.

Sequence data processing and transcriptome de novo assembly

The data were analysed using the free online platform of Majorbio Cloud Platform (www.majorbio.com). We listed five types of original data for each sample, along with their original order number and progress order, in Table S1. To ensure the accuracy of downstream analysis, the raw sequencing data were first filtered to obtain high-quality sequencing data (clean data). The specific steps as follows: 1) Removal of adapter sequences in reads and deletion of reads lacking inserted fragments due to self-ligation of adapters and other reasons. 2) Trimming of low-quality (quality score <20) bases at the 3′ end of the sequence. If any remaining sequence still has a quality score <10, the entire sequence is deleted; otherwise, it is retained. 3) Removal of reads with an N-containing ratio exceeding 10%. 4) Exclusion of reads with adaptors and short reads (read length <20 bp). These reads were trimmed and quality-controlled on raw paired-end reads using fastp v0.19.517 with default parameters. After obtaining high-quality RNA-seq data, we utilised Trinity v2.8.518 for de novo assembly of sequencing reads, generating contigs and singletons.The first step, inchworm, involves decomposing reads, constructing a k-mer graph (K = 25) dictionary, selecting k-mer progressionsm and extending to form contigs. The second step, chrysalis, involves combining a series of contigs into a pruned isoform or a surface set with the same origin, each with its corresponding de Bruijn graph. The third step, butterfly, allows exporting each component of the de Bruijn graph, modifying the full length of the book, and obtaining the final result by tracing the original source of the sequence. The assembly results were assessed and optimised using TransRate v1.0.319. Redundant and similar sequences were removed using CD-HIT v4.5.7. Transcriptome assembly integrity was assessed using BUSCO v3.0.220,21.

The assembled transcripts were searched against several databases, including the NCBI protein non-redundant (NR) database, a manually annotated and reviewed protein sequence database (Swiss-Prot)22, Gene Ontology (GO)23, Pfam24, and Kyoto Encyclopedia of Genes and Genomes (KEGG)25. For analysis in NR, Clusters of Orthologous Genes (COG) and Swiss-Prot, DIAMOND v0.8.37.99 was utilised, applying a cut-off e-value of 1e-5. The Blast2GO v2.9.026 facilitated the acquisition of GO annotations for unique assembled transcripts to describe biological processes, cellular components, and molecular functions. KOBAS v3.027, with a cut-off e-value of 1e-5, was employed in the KEGG pathway analysis. Additionally, HMMER v3.2.128 was used for Pfam with a cutoff e-value of 1e-5. Owing to the absence of a reference genome of Alashan Ground Squirrel, we executed a de novo transcriptome assembly pipeline. A schematic representation of all the working processes is provided in Fig. 3.

Fig. 3
figure 3

Complete workflow for transcriptomes and metagenomics.

Differential expression analysis and functional enrichment analysis

To identify differentially expressed genes (DEGs) between groups, we quantified the gene expression level of each gene using the transcripts per million reads (TPM) method. We used RSEM v1.3.129 to estimate gene abundances and analysed the differential expression of genes between groups. Differential expression analysis was performed using DESeq2 v1.24.030. Genes with |log2 (foldchange)| ≥ 1 and a p-adjust value ≤ 0.05 were considered DEGs. We then conducted functional enrichment analysis to identify the functions of DEGs against GO23 and KEGG31 databases using Goatools v0.6.532 and a custem script developed by Majorbio (Shanghai, CN), respectively. A p-adjust < 0.05 was considered statistically significant. Enrichment analysis of GO and KEGG databases is showed in Fig. 4.

Fig. 4
figure 4

Functional enrichment analysis of the eastern and western slopes, (a) GO enrichment and (b) KEGG enrichment.

Metagenomic DNA extraction and library preparation

Metagenomic DNA extraction was performed using the E.Z.N.A.® Soil DNA Kit (Omega Bio-tek, US) following the manufacturer’s instructions. The procedure involved adding 500 mg of magnetic beads and 0.5 g of SLX-Mlus Buffer to 2 mL of finely ground tissue in a tube, followed by vibration at 45 HZ for 250 seconds. Then, 100 μL of DS Buffer was added and mixed. The sample was incubated at 70 °C for 10 minutes and then at 95 °C for 2 minutes. After centrifugation at 13000 rpm at room temperature for 5 minutes, 800 μL of the supernatant was transferred to a fresh 2 mL tube, to which 270 μL of P2 Buffer and 100 μL of HTR Reagent were added. This was followed by incubation at −20 °C for 5 minutes and then centrifugation at 13000 rpm for another 5 minutes. The supernatant was then transferred to a fresh tube, and an equal volume of XP5 Buffer and 40 μL of magnetic beads were added. After mixing, the magnetic beads were used to adsorb and then remove the residual liquid. The tube was washed sequentially with 500 μL and then 600 μL of XP5 Buffer, followed by 600 μL of PHB. Finally, the tube was washed twice with 600 μL of SPW Wash Buffer. After the final centrifugation at 13000 rpm for 10 seconds, 100 μL of Elution Buffer was added, mixed, and left at room temperature for 5 minutes. The DNA was then transferred from the magnetic beads to a 1.5 mL tube using magnetic force.

The concentration and purity of the extracted DNA were measured using TBS-380 and NanoDrop2000, respectively. The DNA quality was assessed by running it on a 1% agarose gel at a voltage of 5 V/cm for 20 minutes. For library construction, the DNA was fragmented to an average size of approximately 400 bp using the Covaris M220 (Gene Company Limited, CN). The NEXTFLEX® Rapid DNA-seq kit (Bioo Scientific, USA) was used for the library construction. Adapters containing sequencing primer hybridisation sites were ligated to the blunt ends of the fragments. This process included adapter ligation, magnetic bead screening to remove self-ligated adapter fragments, enrichment of library templates through PCR amplification, and magnetic bead recovery of PCR products to obtain the final library.

Bridge PCR and sequencing

Metagenomic sequencing was conducted using the Illumina NovaSeq 6000 sequencing platform at Majorbio Bio-pharm Biotechnology Co., Ltd. (Shanghai, CN) according to the manufacturer’s instructions (Illumina, USA). The process involves one end of the library molecule complementing the primer base, which, after a round of amplification, fixed the template information on the chip. The molecule’s other end, attached to the chip, randomly complements another nearby primer, forming a “bridge”. This PCR amplification resulted in DNA clusters. The DNA amplicons were then linearised into single strands. The addition of modified DNA polymerase and dNTPs with four fluorescent labels allows only one base to be synthesised in each cycle. A laser scans the reaction plate’s surface to read the nucleotide species polymerized in the first reaction round for each template sequence. The “fluorophore” and “termination group” are chemically cut to restore the 3′ end stickiness, enabling the second nucleotide’s polymerization. The sequencing of the template DNA fragment is determined by analysing the fluorescence signal statistics collected in each round.

Sequence quality control and metagenome assembly

Adaptor sequences were removed, and low-quality reads (length <50 bp, quality value <20, or containing N bases) were filtered out using fastp v0.23.016. Metagenomic sequencing data was assembled with MEGAHIT v1.1.233, which utilises succinct de Bruijn graphs to resolve branching issues arising from strain differences. Contigs with a minimum length of 300 bp were kept as the final assembly, which was then used for gene prediction and annotation.

Gene prediction, taxonomy

Open reading frames (ORFs) were predicted from each assembled contig using Prodigal v2.6.334. The predicted ORFs, with a minimum length of 100 bp, were translated into amino acid sequences as potential indicators of protein-coding genes. A non-redundant gene catalogue was constructed using CD-HIT v4.6.120, with a threshold of 90% sequence identity and 90% coverage. Clustering was performed based on the predicted coding fragments in the metagenomic sequencing assembly data. The longest gene in each cluster was selected as the representative sequence, reducing redundancy, and yielding the predicted gene set. High-quality reads were aligned to the non-redundant gene catalogues to calculate gene abundance, with a 95% identity threshold using SOAPaligner v2.2135.

Functional annotation and quality control of annotation

Representative sequences from the non-redundant gene catalogue were aligned to the KEGG25 and COG36,37 databases using DIAMOND v0.8.3538 with an e-value cutoff of 1e-5 for taxonomic annotations. In KEGG functional annotation, the abundance of each functional category was calculated by summing the gene abundances corresponding to KO, Pathway, EC, and Module. The Carbohydrate-Active enZYmes (CAZy)39 database was used for comparison with the amino acid sequences of the non-redundant gene set, employing hmmscan with an expected e-value of 1e-5, to obtain carbohydrate-active enzyme annotation information. The abundance of carbohydrate-active enzymes was then calculated using the sum of the abundances of genes corresponding to these enzymes. The dominant bacterial groups identified were Firmicutes, Bacteroidetes, Verrucomicrobia, Urovircota, and Proteobacteria. An overview of KEGG annotations is shown in Fig. 5.

Fig. 5 
figure 5

KEGG functional annotation of metagenomics.

Data Records

In this study, 20 Alashan Ground Squirrel individuals were used to produce 120 files, comprising different tissue RNA-seq samples and metagenomic samples. Specific details for each sample are provided in Tables 1, 2. Raw RNA-seq data were deposited in the NCBI BioProject40 https://identifiers.org/ncbi/bioproject:PRJNA935915. Raw metagenome data and corresponding assemblies were deposited in the NCBI BioProject41 https://identifiers.org/ncbi/bioproject:PRJNA932588.

Table 2 Summary of sample data information deposited in the SRA database.

Technical Validation

Quality of the raw reads and assembly validation

Over 700 million raw paired-end reads were obtained from 20 biological samples of Alashan Ground Squirrel. Subsequent trimming and filtering retained approximately 580 million high-quality paired-end reads for de novo assembly. The initial Trinity assembly produced 365,309 unigenes with an N50 of 4,992 bp. Transcriptome sequencing data for the five tissues is detailed in Table S1. Following assembly, we identified a total of 72,156 unigenes with an N50 of 6,703 bp and a GC content of 47.51%. The final assembled transcriptome BUSCO completeness score indicates that the assembly completeness is 98.4%. The optimised sequences were filtered for the initial assembly, which is summarised in Table 3. The length distribution of all assembled sequences and functional annotation statistics are depicted in Fig. 6, resulting in the assembly of 72,156 unigenes. The clean reads from each sample were mapped to the reference genome generated by the Trinity assembly, and the mapping statistics are reported in Table S2. This mapping forms the foundation for subsequent gene and transcript quantification for each sample.

Table 3 Evaluation of transcriptome assembly in Alashan Ground Squirrels.
Fig. 6
figure 6

Sequence length distribution of unigenes and evaluation of functional annotation evaluation in different databases. (a) Sequence length distribution, (b) Compare all genes and transcripts obtained from transcriptome assembly with five major databases.

For metagenomics analysis, fastp v0.23.017 was used for data quality control, removing low-quality and N-containing reads from the original sequencing data. This process yielded high-quality sequences for further analysis, as shown in Table 4. The sequence demonstrating the best splicing effect was selected for ORF prediction. Genes with a nucleic acid length of 100 bp or greater were selected and translated into amino acid sequences, which are presented in Tables 5, 6.

Table 4 Clean reads statistics obtained from western and eastern slopes.
Table 5 Metagenome assembly statistics for each individual.
Table 6 Gene prediction statistics for each individual. ORF = Open Reading Frame.

Quality control of annotation

The transcriptome was functionally annotated using DIAMOND38, KOBAS27, and Blast2GO26. We Compared all unigenes and expressed unigenes obtained from transcriptome assembly with major databases (NR, Swiss-prot, Pfam, GO and KEGG databases) to comprehensively gather functional information about unigenes. The annotations from each database are presented in Table 7.

Table 7 Transcriptome annotation.

For metagenomics, functional annotation was performed using DIAMOND v2.0.1338. We obtained species and abundance information for each taxonomic level in each sample. Comparison with the CAZy database provided functional annotation information of carbohydrate-active enzyme genes, where were then statistically analysed. Functional annotations of COG and CAZy are included in Table S3, S4 .

Table 8 Software for transcriptome analysis.
Table 9 Software for metagenomic analysis.