The sequence and de novo assembly of hog deer genome

Hog deer (Axis porcinus) is a small deer species in family Cervidae and has been undergoing a serious and global decline during the past decades. Chengdu Zoo currently holds a captive population of hog deer with sufficient genetic diversity in China. We sequenced and de novo assembled its genome sequence in the present study. A total of six different insert-size libraries were sequenced and generated 395 Gb of clean data in total. With aid of the linked reads of 10X Genomics, genome sequence was assembled to 2.72 Gb in length (contig N50, 66.04 Kb; scaffold N50, 20.55 Mb), in which 94.5% of expected genes were detected. We comprehensively annotated 22,473 protein-coding genes, 37,019 tRNAs, and 1,058 Mb repeated sequences. The newly generated reference genome is expected to significantly contribute to comparative analysis of genome biology and evolution within family Cervidae.

Hog deer (Axis porcinus) is a small deer species in family Cervidae and has been undergoing a serious and global decline during the past decades. Chengdu Zoo currently holds a captive population of hog deer with sufficient genetic diversity in China. We sequenced and de novo assembled its genome sequence in the present study. A total of six different insert-size libraries were sequenced and generated 395 Gb of clean data in total. With aid of the linked reads of 10X Genomics, genome sequence was assembled to 2.72 Gb in length (contig N50, 66.04 Kb; scaffold N50, 20.55 Mb), in which 94.5% of expected genes were detected. We comprehensively annotated 22,473 protein-coding genes, 37,019 tRNAs, and 1,058 Mb repeated sequences. The newly generated reference genome is expected to significantly contribute to comparative analysis of genome biology and evolution within family Cervidae.

Background & Summary
There are 56 cervid species (family Cervidae) in the Red List of International Union for Conservation of Nature 1 and form the second most diverse group among terrestrial artiodactyls 2 . Cervids are widely geographical distribution and show considerable variation on antler phenotype, body size and other morphologic features 3 . Therefore, they are the ideal materials for studying evolutionary dynamics of phenotypes and genetic adaptions to highly diverse environments 4 . With the development of highthroughput sequencing technologies 5 , genome sequences could be obtained in a more economical way and would largely facilitate biological researches in cervids. Although the draft genomes have been recently published for red deer (Cervus elaphus) 6 and reindeer (Rangifer tarandus) 7 , a large number of cervid species remain to be sequenced. Hog deer (Axis porcinus) is a small deer (30-50 kg adult weight) in Cervinae subfamily ( Fig. 1) and mainly distributed in Pakistan, Nepal, India, Bangladesh, Burma, China, Thailand and Laos 8 . A specific feature of hog deer is that it has a narrow habitat in wet or moist tall grasslands. Recently, the wild hog deer has been recognized to globally decrease in population size and even to be almost completely eliminated in China 9,10 . Chengdu Zoo of Sichuan holds the largest captive population of hog deer in China, for which the genetic diversity has been successfully revealed by the genome-wide SNPs in our lab 11 . In the present study, we further sequenced and de novo assembled the genome of hog deer, which is expected to contribute to the comparative analysis of genome biology among cervid species.

Ethics statement
In the present study, blood sample was collected by veterinarian at annual health inspection and tissue samples for RNA extraction were obtained from the accidentally died individuals with fighting injury. The study design and all experimental methods were approved by Animal Care and Use Committee in Chengdu Zoo.

Sample collection and construction of sequencing libraries
The blood was sampled from a healthy female hog deer at two years old. Genomic DNA was isolated using Qiagen DNA purification kit (Qiagen, Valencia, CA, USA). A total of six paired-end and matedpair sequencing libraries with 250 bp, 350 bp, 450 bp, 2 Kb, 5 Kb and 10 Kb of insert sizes were constructed according to Illumina's protocol (Illumina, San Diego, CA, USA). For insert sizes of 250 bp to 450 bp, 0.5 μg of genomic DNA was fragmented, end-paired, and ligated to adaptors, respectively. The ligated fragments were fractionated on agarose gels and purified by PCR amplification to produce sequencing libraries. For the mated-pair libraries with insert sizes of 2 Kb to 10 Kb, 120 μg of genomic DNA was circularized and digested. Furthermore, a 10X Genomics linked-read library was also constructed successfully according to protocol (10X Genomics, San Francisco, USA).
Six tissues including brain, heart, lung, liver, spleen and kidney were sampled for three hog deer. Subsequently, all 18 samples were subjected to RNA extraction using RNAiso Pure RNA Isolation Kit (TaKaRa, Japan), which was followed by DNaseI treatment. NanoVue Plus spectrophotometer (GE Healthcare, NJ, USA) was used to assess concentration and quality of the extracted RNAs. All RNA samples were sequenced by Illumina HiSeq X for generating paired-end reads in 150 bp which three same samples were pooled. All sequencing libraries constructed were detailed in Table 1.

Sequencing and genome assembly
A total of 404 Gb sequencing data were generated from the Illumina's paired-end sequencing. Read quality was analyzed using NGS QC Toolkit 12 and the low-quality reads were discarded according to any one of the three criterions, including (1) reads containing adaptor sequences, (2) reads containing ambiguous bases more than 10% of total length, and (3) reads containing low-quality bases (Q-value o 5) more than 20% of total length. If any member of the paired reads was classified as low quality, both pairs were discarded. After filtering, 395.2 Gb clean bases were obtained for de novo assembly of genome. Also, 230.78 Gb clean bases, out of 236.7 Gb sequencing data, were obtained from 10X Genomics sequencing (Table 1).
SOAPdenovo2 13 was employed for constructing contigs and scaffolds with the optimized parameters of -K 41 and -d 1 for the PREGRAPH step, -k 41 for MAP step, and -l 43 for SCAFF step, respectively. Briefly, contigs were first de novo assembled with short reads, against which all reads were aligned for constructing scaffolds with aid of the paired information of reads. Second, gaps were filled according to the paired information of reads. Third, these initially obtained scaffolds were further improved by incorporating the linked reads of 10X Genomics using Fragscaff 14 with the parameters of -fs1 '-m 3000 -q 30' -fs2 '-C 2' -fs3 '-j 1.25 -u 2'. These processes finally yielded a draft genome of hog deer with a total length of 2.72 Gb, contig N50 of 66.04 Kb and scaffold N50 of 20.55 Mb ( Table 2).
The completeness of genome assembly was assessed by three approaches as followed. The single copy orthologs set (BUSCO, version 2.0) were searched against the assembled genome of hog deer using BUSCO tool 15 , which revealed that 94.5% of the 843 expected genes are present in this assembly. Based on a core gene set involved in 248 evolutionarily conserved genes from six eukaryotic model organisms, the comparative analysis by CEGMA tool 16 similarly revealed that 95.97% of these core genes have been successfully assembled. Finally, the Core Vertebrate Genes (CVG) 17 was used as reference gene set to assess the completeness by gVolante tool (https://gvolante.riken.jp), which also showed that this assembly completely captured 216 core genes(92.70%).

Annotation of gene structure
We employed three approaches for predicting the protein-coding genes within hog deer genome, including homologous comparison, ab initio prediction and RNA-seq based annotation. For homologous comparison, the reference protein sequences from Ensembl database (release 91) for five species of human (Homo sapiens), cattle (Bos taurus), water buffalo (Bubalus bubalus), sheep (Ovis aries) and bactrian camel (Camelus bactrianus) were aligned against hog deer genome using TBLASTN search with parameters of e-value 1e-5 in the "-F F" option 24 . After filtering low-quality records, all blast hits were concatenated. Sequence of each candidate gene was further extended upstream and downstream by 1,000 bp to represent the whole region of this gene, within which the gene structure was predicted using GeneWise tool 25 . RNA reads from six tissues were de novo assembled with Trinity 26 (--normalize_reads, --full_cleanup, --min_glue 2, --min_kmer_cov 2, --KMER_SIZE 25) and the assembled sequences were aligned against hog deer genome using Program to Assemble Spliced Alignment (PASA), by which the effective alignments were assembled to gene structures 27 . We simultaneously employed five tools of Augustus 28 , GeneID 29 , GeneScan 30 , GlimmerHMM 31 and SNAP 32 for ab initio prediction, in which the parameters were computationally optimized by training a set of high-quality proteins that have been derived from the PASA gene models with default parameters. Simultaneously, RNA-seq reads were aligned to hog deer genome using TopHat with default parameters 33 , by which the mapped reads were assembled into gene models by Cufflinks 34 . According to these three approaches, the non-redundant reference gene set was finally generated using EvidenceModeler (EVM) tool 27 . In order to get the UTRs and alternative splicing variation information, we used PASA2 to update the gene models 27 . Finally, we successfully generated reference gene structures within hog deer genome, which is composed of 22,473 protein-coding genes (Table 4). We also predicted gene structures of tRNAs, rRNAs and other non-coding RNAs (Table 5). A total of 37,019 tRNAs were predicted using t-RNAscan-SE tool (--evalue 1e-10) 35 . Because rRNA genes are highly evolutionarily conserved, we choose human rRNA sequence as references and then predicted 920   rRNA genes using Blast tool with default parameters 36 . Small nuclear and nucleolar RNAs were annotated using the infernal tool 37 .

Functional annotation of protein-coding genes
We functionally annotated the predicted proteins within hog deer genome according to homologous searches against three databases of SwissProt 38 , InterPro 39 and KEGG pathway 40 . Of that, InterproScan tool 41 in coordination with InterPro database 39 were applied to predict protein function based on the conserved protein domains and functional sites. KEGG pathway and SwissProt database were mainly mapped by the constructed gene set to identify best match for each gene. Overall, 89.7%, 87.4%, 79.1% genes show positive hits in SwissProt, InterPro, and KEGG, respectively. In summary, a total of 20,994 genes (93.4%) were successfully annotated by function implications or the conserved functional motifs ( Table 6).

Code availability
The following bioinformatic tools and versions were used for generating all results as described in the main text:

Data Records
A total of 12 sequencing runs of DNA-seq (SRR7410909-17, SRR7410919-21) and six runs of RNA-seq (SRX4282445-49, SRX4282453) were obtained and deposited to NCBI Sequence Read Archive (SRA) (Data Citation 1). The assembled draft genome has been deposited at GenBank (Data Citation 2). The annotation results of repeated sequences, gene structure and functional prediction were deposited in Figshare database (Data Citation 3).

RNA integrity
In prior to constructing RNA-seq libraries, the concentration and quality of total RNA were evaluated using Agilent 2100 Bioanalyser (Agilent, Santa Clara, USA). Three metrics, including total amount, RNA integrity and rRNA ratio, were used to estimate the content, quality and degradation level of RNA samples. In this study, only total RNAs with a total amount ≥ 10 μg, RNA integrity number ≥ 8, and rRNA ratio ≥ 1.5 were finally subjected to construct the sequencing library.
Quality filtering of raw reads The initially generated raw sequencing reads were evaluated in terms of the average quality score at each position, GC content distribution, quality distribution, base composition, and other metrics. Furthermore, the sequencing reads with low quality were also filtered out before the genome assembly and annotation of gene structure.