Background & Summary

Cervidae is the second largest family in Ruminantia (second to Bovidae) and consists of 56 species1. Along with the common distinct attributes of ruminants (i.e. even-toe, multi-chambered stomach and headgear), males in Cervidae grow deciduous antlers (except for antlerless Chinese water deer and antlers in both sexes in reindeer)2. Deer are excellent models for studying evolution, biodiversity, interspecies hybridization3,4, social organization (i.e. hierarchical status)5, unique organ development (i.e. fully regenerable antlers)6 and habitat selection (extreme cold vs extreme hot)7,8.

Red deer (Cervus elaphus) is the most studied species in Cervidae and consists of 22 extant subspecies9. Of these subspecies, eight are found in China, and three of these Chinese subspecies inhabit Xinjiang in northwest China: Tianshan red deer (C. e. songaricus Severzov, 1872), Altai red deer (C. e. sibiricus Severzov, 1873) and Tarim red deer (C. e. yarkandensis Blanford, 1892)10,11. Tarim red deer (Fig. 1a) is the only subspecies of red deer resident in Central Asia, a proposed site of origin for the genus Cervus12. This deer subspecies tolerates the extreme dry (mean annual evaporation is 45.8 times more than the precipitation, and mean rainfall is 18.6 mm/year) and hot (average temperature in summer is 32.7 °C) desert environment of the Tarim Basin of southern Xinjiang (Fig. 1b), China10. Although little is known about the biology of this deer subspecies, it is likely to have evolved mechanisms to adapt to this hostile habitat. Recently, Tarim red deer has been classified as an endangered species by IUCN and has been included in the China Red Data Book of Endangered Animals, as the population in its native habitat has been declining10.

Fig. 1
figure 1

Photograph and location of the Tarim red deer selected in this study. (a) A photograph of an adult male Tarim red deer individual, from which blood samples were collected for genome sequencing. (b) A natural distribution map of Tarim red deer (yellow arrowhead).

Whole genome sequencing has become an increasingly popular technology to explore taxonomy, evolution, biological phenomena and distinct attributes of organisms at a genomic level, as opposed to morphological, histological and other means13,14. Chen et al.15,16,17,18,recently published a paper in the prestigious journal “Science”, within which 44 ruminant genomes were sequenced, including 6 deer species15. To date, 13 draft deer genomes have been reported, covering four deer subfamilies: Cervinae (4)15,16,17,18,19, Muntiacinae (3)15,20, Hydropotinae (1)15, and Odocoileinae (5)21,22,23,24,25,26. However, genomes of the most deer species (43) remain yet to be sequenced, including some of the more important deer species with economic value, such as sika deer and red deer (production of precious Chinese medicines, velvet antler). Consequently, the evolution of the distinctive features of these deer species has not been resolved at the genetic level, for example, the adaptation of Tarim red deer to its extremely dry and hot environment. In addition, the quality of these published deer genomes is still not comparable to some other ruminants, such as bovine14. Therefore, whether these deer genomes can be served as a reference genome for relevant future studies is questionable.

This paper reports a high quality Tarim red deer genome, which was generated through the combination of sequences created in the present study using the 10X Genomics GemCode platform with the previously published genetic linkage map data27,28; and is termed here CEY_v1. The final CEY_v1 was 2.60 Gb and consisted of 19,010 scaffolds (scaffolds > = 1 Kb) with 2.21% missing bases, with the contig N50 and scaffold N50 of 275.5 Kb and 31.7 Mb respectively. A total of 269 scaffolds, accounting for 96% of CEY_v1, were anchored onto 34 chromosomes. Almost 100% of the predicted genes (20,652) were annotated using biological databases. We believe that this high-quality reference genome of CEY_v1 will provide a valuable resource for future studies to Tarim red deer in particular, and to Cervidae and even Ruminantia in general, as well as to shed light on the molecular mechanism of animal adaptation to extreme hostile environments.

Methods

Ethics statement

Blood sampling carried out in this study was approved by the Animal Ethics Committee of Institute of Special Wild Economic Animals and Plants, Chinese Academy of Agricultural Sciences (CAAS2017-06).

Genomic DNA extraction

A 4-year-old semi-domesticated male Tarim red deer (Fig. 1a) from the Korla region (Xinjiang Autonmous Region, China) was selected for blood sampling (via jugular using EDTA vacuum tubes). The blood sample was stored at −80 °C until DNA extraction. Genomic DNA was extracted and purified using QIAamp Blood DNA midi kit (Qiagen, Valencia, CA, USA).

Construction of 10x Genomics library

The Genomic DNA concentrations were measured using a Qubit® 2.0 Fluorometer (Life Technologies). Their quality was assessed using 1% gel electrophoresis to determine suitability for 10x Chromium library construction (10x Genomics, San Francisco, USA). Genomic DNA (total of 1.2 ng) was used for library construction after passing quality assessment according to the manufacturer’s instructions without size-selection. The barcode sequencing libraries were quantified using qPCR (KAPA Biosystems Library Quantification Kit for Illumina platforms). Finally, sequencing was conducted with 2 × 150 paired-end reads in two lanes using the Illumina HiSeq. 4000 platform at BGI (China).

Genome sequencing and de novo assembly

In total, 195 Gb sequencing data were generated from the Illumina paired-end sequencing. After low-quality reads were removed using NGS QC Toolkit29 with default parameters, 183.5 Gb of clean bases were obtained for de novo assembly using the Supernova (v2.0.1, 10x Genomics) assembler. The estimated genome size was 2.86 Gb with 63-fold raw and 43-fold effective coverage. The final size of our assembled draft genome was 2.60 Gb, with 19,010 scaffolds (scaffolds >  = 1 Kb) with 2.21% missing bases, with contig N50 and scaffold N50 of 275.5 Kb and 31.7 Mb respectively.

Anchorage of the genome assembly onto chromosomes

We further anchored these scaffolds onto chromosomes using ALLMAPS (v0.8.4)30 based on the published high-quality red deer genetic linkage map27,28. This published map consists of 34 sex-averaged linkage groups including a total of 38,083 SNP markers based on the haploid chromosome number for red deer with 2,740 cM in combined length. The locations of SNPs were obtained by mapping the probe sequences (150 bp on both ends) of these SNP markers to our assembled sequences using BWA (v0.7.17)31. The probes with multiple alignments were removed. At the end, we successfully placed 38,042 (99.89%) uniquely-mapped SNPs onto 34 chromosomes (Fig. 2). The information of the location of the SNPs in our assembly were retained for downstream analysis. To take advantage of the public availability of female and male genetic maps, the two maps were assigned equal weight and merged. Overall, we anchored 269 scaffolds onto 34 chromosomes, representing 95.9% of the total assembled genome. Of these scaffolds, 160 had more than two markers and were oriented, representing 94.2% of CEY_v1 (Fig. 2 and Table 1). In CEY_v1, three small autosomes (i.e. chr 3, 8 and 31) contained only one large scaffold, whereas sex chromosome X had the highest number of scaffolds (Fig. 2). Given that the genetic linkage map is from a closely-related subspecies, we arbitrarily set 100 bp for the size of gaps that were unknown.

Fig. 2
figure 2

Circos plot showing 34 chromosomes of CEY_v1. (a) chromosome length in Mb unit; (b) arrangement of the scaffolds (>1 Mb) in random colors within each chromosome; (c) the heatmap mapped SNPs number within 1 Mb window, ranging from 0 to 60; (d) histogram showing the GC skewer of 1 Mb windows with 1 Kb step size; (e) line plot of gene density for 1 Mb windows, and (f) line plot of repeat density for 1 Mb windows.

Table 1 Statistics of chromosome anchoring based on the SNP markers.

Identifying Y chromosome scaffolds

Because of its repetitive nature, assembling the Y chromosome is particularly challenging. Using a previous Y chromosome assemblies from cattle14 and red deer19, we detected 37 scaffolds that are likely to be located on the Y chromosome using BLAST tools (E-value ≤ 1e−50). These encompass a total length of 5.15 Mb. Among the 33 genes structurally annotated on those scaffolds, four were identified as SRY, TSPY1, TSPY3 and ZFY. In humans, these four genes are linked to the Y chromosome, confirming the location of the four Tarim red deer scaffolds identified on the Y chromosome.

Annotation of repeat sequences

We annotated the repeat sequences in CEY_v1 using both de novo predictions and homology-based searching in the known repeat database. RepeatModeler (v1.0.11)32 and LTR_FINDER (v1.0.5)33 were used to construct the de novo repeat library. We used RepeatMasker (v3.3.0, http://www.repeatmasker.org/) with the RepBase (v17.01, http://www.girinst.org/repbase)34 transposable element (TE) library to identify known repeats in our genome. In addition, RepeatProteinMask in RepeatMasker (v3.3.0) was used to identify the TE proteins. Tandem Repeats Finder (TRF, v4.07)35 was used to identify the tandem repeats. The results showed that CEY_v1 contained a total of 1.09 Gb of non-redundant repetitive sequences, which accounted for 42.4% of the whole genome (Fig. 2 and Table 2). The main elements were LINEs, which accounted for 37.8% (980 Mb) of the whole genome (Table 3).

Table 2 Prediction of repeat elements in the Tarim red deer genome.
Table 3 Statistics of repeat elements in the Tarim red deer genome.

Gene prediction and functional annotation

After the repeat sequences were masked, de novo prediction was carried out with the Bos taurus training set based on default parameters using Augustus (v3.2.1)36. For homology prediction, protein sequences from six mammals (Bos taurus, Homo sapiens, Sus scrofa, Ovis aries, Equus caballus and Balaenoptera acutorostrata) retrieved from the NCBI database were aligned to CEY_v1 using tBLASTn (E-value ≤ 1e−5). GeneWise (v2.4.0)37 was then used to align against the matching proteins for accurate spliced alignments for the prediction of gene structure. Finally, GLEAN (v1.0.1)38 was used to combine homology with de novo gene models to form a comprehensive and non-redundant reference gene set with the following parameters: the minimum coding sequence length was 150 bp and maximum intron length was 10 Kb. We identified 20,652 protein-coding genes (Fig. 2 and Table 4) in our CEY_v1.

Table 4 The statistics of gene models of protein-coding genes annotated in the Tarim red deer genome.

Functional annotation of the protein-coding genes was carried out using BLAST tools (E-value ≤ 1e−5) against the NCBI non-redundant proteins (NR), TrEMBL, Gene Ontology (GO), SwissProt39 and Kyoto Encyclopedia of Genes and Genomes (KEGG)40 respectively. Overall, 20,652 (100%) protein-coding genes were annotated with at least one public functional database (Table 5).

Table 5 Statistics of functional annotation.

Data Records

Illumina DNA sequencing data from 10x Genomics libraries (Experiments under the SRA study accession: SRP220754) were submitted to the NCBI Sequence Read Archive (SRA) database under BioProject accession number PRJNA56436241. The assembled genome42 was deposited at DDBJ/ENA/GenBank under the accession WMHW00000000. The version described in this paper is version WMHW00000000.143. Chromosome Y sequences of CEY_v1 were deposited at figshare44. Gene structure annotation, repeat predictions and gene functional annotation files of CEY_v1 were deposited at figshare45.

Technical Validation

By comparing the assembled metrics of the scaffolds of Tarim red deer and the other deer species (Table 6), our CEY_v1 represents a substantial improvement in both contig and scaffold lengths, indicating that our assembly was highly contiguous. The similarity of the assembled length and the low number of gaps provide evidence that our CEY_v1 is a high quality genome assembly, which can be used with confidence for further downstream relevant analysis and investigation.

Table 6 Comparison of the deer genome assembly metrics.

To estimate the quality of anchored chromosomes, we compared the physical and genetic maps. The reconstructed chromosomes showed few conflicting markers, and the female and male genetic maps exhibited perfect collinearity, except for chromosome X (i.e. chromosome 34) (Fig. 3a and Supplementary Fig. S1). Furthermore, two scatter plots, where dots represent the physical position (x-axis) versus the genetic map distance (y-axis), revealed no breaks, illustrating near-perfect collinearity (Fig. 3b and Supplementary Fig. S1). In addition, the size of the reconstructed chromosomes was highly consistent (R2 = 0.987) with previous estimates27, also indicating the high quality of anchorage of scaffolds onto chromosomes (Fig. 3c).

Fig. 3
figure 3

Reconstructed chromosome 1 of the Tarim red deer genome (CEY_v1) using two genetic maps: the red deer female and male genetic maps with equal weights. (a) “Side-by-side” alignments between chromosomes and the linkage groups. The conflict markers are shown as across lines. (b) Two scatter plots, in which dots representing the physical position (x-axis) versus the genetic map distance (y-axis) on the chromosomes, showed a monotonic trend and no breaks for illustrating near-perfect collinearity. Adjacent scaffolds within the chromosome are shown as boxes with alternation shades, marking the boundaries of the component scaffolds. The ρ-value on each scatter plot measures the Pearson correlation coefficient, with values in the range of −1 to 1 (values closer to −1 and 1 indicate near-perfect collinearity). (c) Correlation between the size of the reconstructed chromosomes and those of the previous estimation by Johnston, et al.27.

To assess the completeness of our CEY_v1, we performed an analysis using Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0) with the mammalia_odb9 database46. Our analysis showed that 94.1% of the expected mammalian genes (including 90.5% single and 3.6% duplicated ones) had complete gene coverage, and 2.3% were identified as fragmented, respectively, while 3.6% were considered missing in our CEY_v1.