The sequence and de novo assembly of the wild yak genome

Vulnerable populations of wild yak (Bos mutus), the wild ancestral species of domestic yak, survive in extremely cold, harsh and oxygen-poor regions of the Qinghai-Tibetan Plateau (QTP) and adjacent high-altitude regions. In this study, we sequenced and assembled its genome de novo. In total, six different insert-size libraries were sequenced, and 662 Gb of clean data were generated. The assembled wild yak genome is 2.83 Gb in length, with an N50 contig size of 63.2 kb and a scaffold size of 16.3 Mb. BUSCO assessment indicated that 93.8% of the highly conserved mammal genes were completely present in the genome assembly. Annotation of the wild yak genome assembly identified 1.41 Gb (49.65%) of repetitive sequences and a total of 22,910 protein-coding genes, including 20,660 (90.18%) annotated with functional terms. This first construction of the wild yak genome provides a variable genetic resource that will facilitate further study of the genetic diversity of bovine species and accelerate yak breeding efforts.

www.nature.com/scientificdata www.nature.com/scientificdata/ were approved by the Ethical Committees of Lanzhou University. All libraries were sequenced on an Illumina HiSeq. 2000 platform with 150 bp read length, following the manufacturer's instructions. Finally, 760.85 Gb of raw data were generated in total (Table 1).
preprocessing and genome size estimation. All the sequencing reads were preprocessed for quality control and filtered with stringent criteria using Lighter v1.1.1 10 software. Firstly, raw data were filtered by removing reads with >10% unknown bases. Then, paired reads with low-quality bases (quality scores ≤7) covering more than 65% of the read length were filtered out. Reads with PCR duplicates or adapter contamination were also removed. Finally, both read 1 and read 2 files were filtered out if they had >10 bp overlap, allowing 10% mismatch. In total, 662.3 Gb of clean reads were obtained after filtering (Table 1).
Prior to genome assembly, all the preprocessed sequences from the short-insert library were subjected to genome size estimation using Genome Characteristics Estimation (GCE) with a k value of 21 Table 4. Summary statistics of interspersed repeats in the assembled wild yak genome.
www.nature.com/scientificdata www.nature.com/scientificdata/ wild yak was estimated to be around 3.09 Gb, using the following formula: genome size = k-mer number/k-mer depth, where the k-mer number refers to the total number of k-mers, and k-mer depth is the depth of the main peak in the k-mer frequency distribution (Fig. 1).
Genome assembly. For de novo genome assembly, Platanus software was used for constructing contigs and scaffolds with default parameters, and GapCloser was employed to fill the remaining gaps in the scaffolds with all sequencing reads. These steps finally yielded a wild yak draft genome with a total length of 2.83 Gb, accounting for 91.5% of the estimated genome size (contig and scaffold N50 sizes: 63.2 kb and 16.3 Mb, respectively) ( Table 2). www.nature.com/scientificdata www.nature.com/scientificdata/ To evaluate the completeness of our assembly, we carried out BUSCO 11 analyses and the results indicated that 3,974 of the 4,104 conserved single-copy genes in mammals were present in our assembly, of which 3,799 were single, 55 were duplicated, and 120 fragmented matches (Table 3). To validate the single-base accuracy of the genome assembly, we aligned the high-quality reads of short-insert libraries to the assembly using Burrows-Wheeler Aligner (BWA, v0.7.15-r1140) 12 software, and the alignment outputs were converted to Binary Alignment Map (BAM) format via SAMtools v1.3 13 . The genome coverage was then calculated by a custom Perl script, which indicated that more than 93.9% of the assembly had >20-fold coverage.
Repeat annotation. Repetitive regions of the wild yak genome were identified using a combination of de novo and homology-based approaches, as applied in a previous analysis of the Ovis ammon polii genome 14 . For the de novo prediction, RepeatModeler v1.0.11 was employed first to construct a de novo repeat library, then RepeatMasker v4.0.7 15 was used to identify repeats using both the RepBase 16 library of known transposable elements (TEs) and a self-trained repeat database. Next, we applied RepeatProteinMask (a package in RepeatMasker) to identify repeats at the protein level using the TE protein database. In addition, tandem repeats were further annotated using Tandem Repeat Finder (TRF, v4.0.9) 17 . Finally, the non-redundant repeats were checked according to their coordinates in the genome. Overall, we identified 1.41 Gbp of non-redundant repetitive sequences, representing 49.65% of the wild yak genome assembly; of which long interspersed elements (LINE) were the most abundant, accounting for 35.98% of the whole genome ( Fig. 2; Table 4).
Gene prediction and annotation. We employed a combination of homology-based and de novo prediction methods to identify protein-coding genes. For homology-based prediction, protein sequences of seven species (Bos taurus, Equus caballus, Homo sapiens, Ovis aries, Sus scrofa, Bison bonasus, Bos grunniens) downloaded from Ensembl 18 and GigaDB 19,20 were aligned to the wild yak genome using TBLASTN 21 . Then GeneWise v2.4.1 22 software was applied to search for accurately spliced alignments based on the filtered homologous genome sequences. For de novo prediction, we used Augustus 23 , Geneid 24 , GeneMark, GlimmerHMM 25 and SNAP 26 to predict genes with parameters trained on wild yak and human repeat-masked genomes. EVidenceModeler software (EVM, v1.1.1) 27 was employed to generate a consensus gene set by integrating the genes predicted by the homology and de novo approaches. Low-quality genes of short length (proteins with fewer than 30 amino acids) and/or exhibiting premature termination were removed to produce the final gene set, which is composed of 22,910 genes ( Fig. 3; Table 5).
Putative biological functions of these predicted high-quality genes were assigned by searching against five publicly available databases: TrEMBL, Swiss-Prot 28 , InterPro 29 , Gene Ontology (GO) and Kyoto Encyclopedia of   www.nature.com/scientificdata www.nature.com/scientificdata/ Genes and Genomes (KEGG) 30 . Approximately 90.18% of these genes were functionally annotated with at least one of these databases, with 90.05, 88.15, 83.51, 64.20 and 53.01% scoring positive hits in TrEMBL, SwissProt, InterPro, GO and KEGG, respectively (Table 6).

Data Records
The whole genome sequencing data were submitted to the NCBI Sequence Read Archive (SRA) database with accession number SRP194583 and Bioproject accession PRJNA531398 31 . The assembled draft genome of wild yak has been deposited at GenBank under the accession number of VBQZ00000000 32 . The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database 33 . technical Validation Quality assessment of the genome assembly. The assembly presented here is the first wild yak genome version. The contig N50 and scaffold N50 sizes were 63.2 kb and 16.3 Mb respectively, with the longest scaffold 75,900,441 bp. There are 258 scaffolds more than 1 Mb long, with a total length of 2,486,540,864 bp, representing 87.83% of the wild yak genome. By aligning the reads of short insert libraries to the wild yak assembly, we found more than 93.9% of the genome had >20-fold coverage, indicating high accuracy at the nucleotide level. BUSCO analysis carried out to assess the completeness of our assembly resulted in a BUSCO score of 96.8% (complete = 93.8%, single = 92.4%, duplicated = 1.4%, fragmented = 3.0%, missed = 3.2%, genes = 4,104). These results are comparable with those for the published European bison (wisent) 34 and domestic yak 4 genomes, suggesting our assembly has high quality and is quite complete.
Gene prediction and annotation validation. Gene models in the wild yak assembly were predicted using a combination of homology-based and ab initio gene approaches. Then EVM software was employed to integrate the gene prediction results to produce a consensus gene set. To enhance the quality of the gene prediction, we removed low-quality genes of short length (proteins with fewer than 30 amino acids) and/or exhibiting premature termination. The final gene set consisted of 22,910 genes, and the distributions of gene length, CDS length, exon length, intron length and exon number were similar to those of other mammals (Fig. 3). BUSCO analysis was also performed to assess the completeness of these predicted genes, resulting in a BUSCO value of 97.7% (complete = 94.8%, single = 93.1%, duplicated = 1.7%, fragmented = 2.9%, missed = 2.3%, genes = 4,104) ( Table 6). In addition, functional annotation of these predicted genes indicated that 90.18% of them could be assigned to at least one functional term (Table 5). These results clearly indicate that the annotated gene set of the wild yak genome is quite complete.

Code availability
The software versions, settings and parameters used are described below.