Background & Summary

The Siberian chipmunk, Tamias sibiricus (Laxmann, 1769) belongs to the subfamily Xerinae, within the family Sciuridae of the order Rodentia1. This species is a small, diurnal and ground-dwelling squirrel that lives in mountain and forest habitats with bushy understory2. The wild populations of T. sibiricus are naturally distributed in Russia and several east Asian countries (China, Mongolia, Korea and Japan). Meanwhile, this squirrel is one of most popular companion animals because of its attractive appearance and unique behavior3. Hence, it has been introduced as pets into European countries for decades and the accidentally escaped individuals have successfully established their populations in the wild4. Additionally, as important seed dispersal agents adopting the primary strategies of scatter- and larder-hoarding behavior, T. sibiricus provides essential seed dispersal services in many ecosystems across the world5. Over the past decades, studies of T. sibiricus have mainly focused on biology, behavior, ecology, and phylogeography5,6,7,8. However, little is known about the genetic basis and mechanism of its environmental adaptation because of limited molecular information.

In the present study, we constructed a high-quality genome assembly for the Siberian chipmunk using the integration of short reads (Illumina sequencing), long reads (PacBio sequencing) and Hi-C reads (proximity ligation chromatin conformation capture). The final assembled genome size of T. sibiricus was 2.64 Gb with the scaffold N50 length of 172.61 Mb. A total of 2.59 Gb assembled genome sequences were successfully anchored on 19 chromosomes. This number of chromosomes was consistent with the outputs of the karyotype ananlysis9. 1.03 Gb repetitive sequences were identified, constituting 38.87% of this reference genome. A total of 25,311 protein-coding genes were predicted, and 97.69% of these genes were functionally annotated.

Methods

Sample collection and ethics statement

An adult female specimens of T. sibiricus was originally collected from a forestry farm in Chifeng, Inner Mongolia Autonomous Region of China (41°39′N, 118°22′E) in October 2020. The sample was then maintained at Qufu Normal University, and stored at −80°C prior to DNA and RNA extraction. All experiments were performed according to the Guidelines for the Care and Use of Laboratory Animals in China. The sampled squirrel in this study was approved by the Institutional Animal Care and Use Committee (IACUC) of Qufu Normal University, Shandong, China, under permit No. 2021095.

Sequencing

Muscle tissue of the female body was prepared for transcriptome, Illumina, PacBio whole-genome and Hi-C sequencing. All sequencing analyses were performed by the Shanghai Origingene Bio-pharm Technology Co. Ltd. (Shanghai, China). Genome DNA was extracted using a Blood & Cell Culture DNA Mini Kit (Qiagen, Germany). Quantity and quality of the total DNA were determined by 2100 Bioanalyzer (Agilent, USA) and Qubit 3.0 Fluorometer (Invitrogen, USA), respectively. Total RNA was isolated using a TRIzol Total RNA Isolation Kit (Takara, USA) following the manufacturer’s protocols10. The NanoDrop 2000 spectrophotometer (Labtech, USA) and 2100 Bioanalyzer were used to check RNA quality.

Whole-genome shotgun sequencing was performed with a single molecule real-time (SMRT) PacBio system. PacBio Sequel II libraries with an insert size of 30 kb were prepared using a SMRTbell Template Prep Kit 2.0. For survey analysis and the error rates associated with long reads, two short paired-end libraries with an insert size of 350 bp were constructed using Truseq DNA PCR-free Kit (Illumina, USA). The next-generation sequence data was generated on the Illumina Hiseq X10 platform. To construct pseudo-chromosomes, the Hi-C library was constructed according to the standard protocols described previously11. After quality control, 150 bp paired-end reads (PE150) were obtained using the Illumina Hiseq X10 platform. The cDNA library was constructed using a TruSeq RNA Sample Prep Kit v2 (Illumina, USA) and sequenced on the Illumina Hiseq X10 system using the paired-end strategy.

Genome survey and assembly

A total of 132.39 Gb Illumina short-insert-size data was firstly generated to get a preliminary understanding of the genome characteristics (Table 1). Based on the clean data with duplications removed, the K-mer frequency distribution was calculated with Jellyfish v2.2.612 and the results were subsequently analyzed by GenomeScope v2.013. The genome size of T. sibiricus was estimated to be 2.51 Gb with the number of unique K-mers peaked at 21 (Fig. 1). Evaluation of genome characteristics showed the heterozygosity rate of the assembled genome was 0.21% (Table S1).

Table 1 Statistics of the DNA sequence data used for genome assembly.
Fig. 1
figure 1

K-mer analysis of Tamias sibiricus genome.

For PacBio sequencing, approximately 111.63 Gb long reads were obtained after removing adaptors in polymerase reads with default parameters. The mean length and N50 length of PacBio subreads was 35.62 and 24.13 kb, respectively (Table 1). After self-corrected and long read polished, genome initial assembly was performed using Canu v1.814. As a result, we generated a 2.65 Gb genome assembly with the contig N50 of 9.40 Mb (Table 2). To further improve the quality and accuracy of the genome assembly, we corrected the genome by short-read polishing with high coverage of Illumina reads using Pilon v1.2315. Total size of the draft genome assembly was 2.64 Gb with an N50 length of 9.43 Mb. For the chromosome-level assembly, 217.38 Gb Hi-C sequencing data was generated and used to anchor contigs into pseudo-chromosomes (Table 1). 3D-DNA v180922 pipeline was used to generate a chromosome-level assembly of the genome16. After removing the duplicates, the Hi-C contact map was directly taken as input for 3D-DNA, the location and direction of each contig was determined, and the neighboring contigs were connected using 100 N gaps (100 Ns). Juicebox v1.11.08 (Juicebox Assembly Tools, JBAT) was subsequently used to review and manually curate scaffolding errors17. The final size of this genome was 2.64 Gb with a scaffold N50 of 172.61 Mb (Table 2). Results showed that the size of the assembled Siberian chipmunk genome was near to that estimated from the genome survey analysis. Meanwhile, 2.59 Gb data on the base level was anchored and orientated onto 19 chromosomes with a mounting rate of up to 98.03%, and the chromosome lengths ranged from 28.70 to 222.90 Mb (Table 3 and Fig. 2). After scaffolds were clustered, ordered and orientated to restore their relative locations, the heatmap of chromosome crosstalk indicated that the genome assembly was complete and robust (Fig. 1B).

Table 2 Summary of each step in construction of the T. sibiricus genome assembly.
Table 3 Statistics of chromosomal level assembly of T. sibiricus.
Fig. 2
figure 2

Heat map of Hi-C assembly of Tamias sibiricus. Color bar shows contact density from red (high) to white (low).

Chromosome synteny

Collinearity analysis of chromosomes between T. sibiricus and two other Xerinae species (Sciurus vulgaris and Sciurus carolinensis) was conducted with LASTZ v1.02.0018. As shown in Fig. 3, all 19 pseudochromosomes of T. sibiricus displayed high homology with the corresponding chromosomes of another two squirrels, and two chromosomes (chr11 and chr15 of S. vulgaris, chr11 and chr14 of S. carolinensis) were fused to the chromosome (chr11) in the Siberian chipmunk. Previous studies, using cross-species chromosome painting, showed that the diploid number of chromosomes vary among the species in the superorder Glires (Rodentia and Lagomorpha)19,20, with the Siberian chipmunks having 38 chromosomes9. Interestingly, the variation seems to follow a certain pattern, such as chromosome 32,34,36,38,40. Combine that with our results of chromosome synteny, chromosome fusions and fissions might occur frequent among genome evolution of Glires. Thus, further studies are needed to determine the molecular mechanism of chromosomal rearrangements and evolution with more available chromosome-level genomic data.

Fig. 3
figure 3

Genomic synteny between Tamias sibiricus and two other Xerinae species (Sciurus vulgaris and Sciurus carolinensis).

Repeat annotation

After the genome assembly, annotation with 3 different types of repetitive sequences, non-coding RNAs (ncRNAs) and protein-coding genes (PCGs) was performed. RepeatModeler v2.0.1 was used to identify the repetitive elements with default parameters, and a de novo repeat sequence library was built using the results21. Then, a custom library was constructed combining with Dfam 3.122 and RepBase 20181026 databases23. For the homology prediction, repetitive elements were masked using RepeatMasker v4.1.0 on the custom library24. A total of 1.03 Gb repetitive sequences were identified, constituting 38.87% of T. sibiricus genome. The predominant four categories of transposable elements (TEs) consisted of long interspersed nuclear elements (LINEs, 18.63%), DNA transposon elements (2.71%), long terminal repeats (LTRs, 10.11%), and short interspersed nuclear elements (SINEs, 8.90%) (Table 4 and Fig. 4). All ncRNAs (rRNAs, snRNAs and miRNAs) were annotated using Infernal v1.1.325 and tRNAscan-SE v2.0.726. Only high-confidence tRNAs were retained using the tRNAscan-SE script ‘EukHighConfidenceFilter’. Different types of noncoding RNAs (ncRNAs) were also annotated, yielding 6,265 tRNAs, 830 small nuclear RNAs (snRNAs), 92 ribosomal RNAs (rRNAs) and 595 micro RNAs (miRNAs) (Table S2).

Table 4 Repeat annotation in the T. sibiricus genome.
Fig. 4
figure 4

Genome characteristics of Tamias sibiricus. From the outer ring to the inner ring are the distributions of RNA TEs, DNA TEs, gene density and GC content.

Protein-coding gene annotation

MAKER v3.01.03 pipeline was used to predict protein-coding genes with an integration of three strategies, including ab initio prediction, transcriptome-based annotation and homology-based annotation27. The ab initio prediction was generated using the pipeline BRAKER v2.1.528, which automatically trained the predictors Augustus v3.3.429 and GeneMark-ET30, and made use of the mapped transcriptome data and protein homology information. The transcriptome information in BAM alignments was produced by HISAT2 v2.2.031, and the protein sequences were extracted from the database OrthoDB10 v132. For transcriptome-based annotation, the data of RNA-seq was firstly mapped to our assembly with HISAT2, and the transcriptome information in BAM alignments was produced. With the reference genome of our assembly, the RNA-seq data were further assembled into transcripts using StringTie v2.1.433. Protein sequences of five model rodentian species (Cricetulus griseus, Dipodomys ordii, Ictidomys tridecemlineatus, Marmota marmota and Rattus norvegicus) were downloaded from NCBI Refseq database. And all sequences were used as reference required by MAKER for the homology-based prediction. Overall, 25,311 protein-coding genes were predicted with an average gene length of 32,936 bp. The average exon number per gene was 7.52, with average exon length of 171.85 bp, and average intron length of 4850.84 bp. The final gene models predicted above were then annotated using the non-redundant (NR) protein database of NCBI, Swissprot, Pfam, the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases. In total, 23,995 (94.73%) were successfully annotated for at least one homologous hit by searching against these five public databases. Based on BUSCO analysis, 94.4% of the BUSCO database (mammalia_odb10) genes were identified (complete single-copy genes: 92.2%, fragmented genes: 1.5%), further underlining the accuracy and completeness of gene prediction.

Gene family

OrthoFinder v2.3.8 was used to inferred gene families (orthologue groups, orthogroups) with Diamond as the sequence aligner34. The protein sequences in the T. sibiricus genome and high-quality protein annotation sequences from assembled genomes of 19 rodents were used for analysis, including the naked mole‐rat (Heterocephalus glaber), Eurasian squirrel (S. vulgaris), eastern grey squirrel (S. carolinensis), alpine marmot (M. marmota), thirteen‐lined ground squirrel (I. tridecemlineatus), Arctic ground squirrel (Urocitellus parryii), Daurian ground squirrel (Spermophilus dauricus), Iberian mole (Talpa occidentalis), Ord’s kangaroo rat (D. ordii), European blind mole (Nannospalax galili), white-footed mice (Peromyscus leucopus), deer mouse (Peromyscus maniculatus), southern grasshopper mouse (Onychomys torridus), prairie vole (Microtus ochrogaster), Chinese hamster (Cricetulus griseus), golden hamster (Mesocricetus auratus), Norway rat (R. norvegicus), mouse (Mus musculus) and degu (Octodon degus). 20,952 gene families were identified among 20 species, and a total of 433,351 genes were obtained and assigned to the orthogroups (gene families) using OrthoFinder (Table S3). Gene family analysis also showed that the genes of single-copy orthologs was 5,277. Out of the 25,311 genes of T. sibiricus, 18,863 were clustered into 15,629 orthogroups, and 148 gene families and 502 genes were unique to T. sibiricus. The number of genes assigned to different orthologous groups was displayed in Fig. S1 and Table S4.

Data Records

The genomic Illumina sequencing data was deposited in the NCBI Sequence Read Archive (SRA) database under accession No. SRR1992923035.

The genomic Pacbio sequencing data was deposited in SRA database under accession No. SRR1996122336.

The transcriptome Illumina sequencing data was deposited in SRA database under accession No. SRR1996127837.

The Hi-C sequencing data was deposited in SRA database under accession No. SRR1996053038.

The assembled genome was deposited in the GenBank at NCBI under accession No. GCA_025594165.139.

Genome annotation information of repeated sequences, gene structure and functional prediction is available in the Figshare database40.

Technical Validation

The completeness and accuracy of the assembled genome were evaluated using two different strategies. First, BUSCO analysis revealed that 92.9% (single-copied gene: 92.2%, duplicated gene: 0.7%) of 9226 single-copy orthologues (in the mammalia_odb10 database) were successfully identified as complete, 1.5% were fragmented and 5.6% were missing in the assembly (BUSCO v4.0.5). Second, we mapped the sequencing data to the assembled genome for verifying the accuracy. The mapping rates was 97.42%, 98.00% and 96.03% for the Illumina, RNA-seq and PacBio data, respectively.