Background & Summary

The emergence and rapid development of new sequencing technologies, such as single-molecule real-time sequencing (SMRT) by Pacific Biosciences (PacBio)1, nanopore sequencing by Oxford Nanopore Technologies2 10x genomics3, optical and chromosome-contact maps from BioNano Genomics4, and Hi-C sequencing5 have facilitated the construction of high-quality genome assemblies6. Integrative application of these techniques elevate human genome assembly to a Telomere-to-Telomere (T2T) gapless level. Subsequent mining of the human T2T genome unveils genetic variations of mitoplasts, centromeres and other previously unassembled regions, which significantly enriched the understanding of human genome diversity, evolution and disease occurrence7. Other studies on important livestock also showed the important values of high quality genome assemblies. Qiu et al. presented the draft genome sequence of a female domestic yak generated using Illumina-based technology and related to sensory perception and energy metabolism, as well as an enrichment of protein domains involved in sensing the extracellular environment and hypoxic stress8. Zhang et al. used long-read resequencing data for 6 wild and 23 domestic yaks and identified genes that are predominantly related to the nervous system, behaviour, immunity, and reproduction and may have been targeted by artificial selection during yak domestication9. Liu et al. constructed a graph-genome for 47 genomes of 7 cross-fertile bovine species and validated a significant association of the selected stratified SVs with gene expression10. These aforementioned research greatly deepened the understanding of the genetic basis of high-altitude adaptation of bovine species. Decoding of high-quality genome assemblies is increasingly becoming a powerful approach serving clarification of the genetic mechanisms of adaptive evolution.

L. oiostolus is a species endemic to the Qinghai-Tibet Plateau (QTP) and is also commonly known as the gray-tailed hare. It belongs to the genus Lepus11. It is distributed at an elevation of 2,100–4,000 m in the alpine zone of the Qinghai-Tibet Plateau in China, making it the highest-altitude rabbit species in the world. In contrast, the typical habitat of rabbits rarely exceeds 600 m at altitude12. For instance, Oryctolagus cuniculus, a closely related species of L. oiostolus, has been used as a model organism for biomedical research. Previous physiological studies demonstrated that L. oiostolus has developed unique tolerance to plateau environment, making this species an excellent model for investigation of genetic mechanisms contributing to high-altitude adaptation13. Although deciphering of genome assemblies of plateau animals have significantly broaden our knowledge horizon concerning plateau adaptation and evolution, the genome information of L. oiostolus is still lacking8,14,15.

In this study, we assembled a chromosome-scale genome of L. oiostolus by combining Illumina and PacBio data with Hi-C technology. The final genome size was 2.85 Gb with N50 sizes of 64.25 Mb and a complete Benchmarking Universal Single-Copy Orthologs (BUSCO) score of 96.2%. A total of 2.80 Gb (98.1%) genome sequences were further clustered and ordered into 24 chromosomes. Genome annotation predicted 22,295 protein-coding genes. The availability of a complete and detailed genome assembly is essential to basic biological research. This paper provides a valuable genomic resource for research into the molecular mechanisms and evolution of L. oiostolus.

Methods

Sample collection and sequencing

L. oiostolus samples for genome sequencing were collected from Haiyan County, Haibei Prefecture, Qinghai Province, China (100°98’E, 36°90’N). Genomic DNA was extracted from muscle tissue and blood using the TIANamp Genomic DNA kit. Agarose gel electrophoresis (0.7% agarose gel) was used to assess overall DNA quality. DNA purity was assessed using a NanoDrop One Spectrophotometer (Thermo Fisher Scientific), and DNA concentration was determined using a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).

We implemented a hybrid strategy combining Illumina short-read sequencing, PacBio long-read sequencing, and Hi-C sequencing technologies to obtain better sequencing data. For Illumina short reads, a genomic library with insert sizes of 150 bp was constructed using the NextEra DNA Flex X Library Prep Kit (Illumina, San Diego, CA, USA). Qubit 2.0, Agilent 2100, and qPCR were used to ensure the quality of the library, and then the library was subsequently sequenced on the Illumina NovaSeq. 6000 platform (Illumina, San Diego, CA, USA). For PacBio long-read sequencing, PCR-free SMRT bell libraries were constructed and sequenced on the PacBio Sequel II sequencing platform. To generate a chromosomal-level assembly of the L. oiostolus genome, a Hi-C library was generated by the DpnII restriction enzyme following in situ ligation protocol14. The formaldehyde cross-linked DNAs were digested with DpnII and treated with biotin, which were ligated to the ends of the fragmented DNA sequences. Cyclization of DNA after end-repair was used to identify the location of intersecting DNA. After reversing the crosslinking, ligated DNA fragments were fragmented into 300–700 bp sizes, followed by a biotin-streptavidin purification for library construction. Finally, the Hi-C libraries were quantified and sequenced with PE150 sequencing read lengths on the Illumina platform. Finally, 79.03 Gb of PacBio reads (Table S1) and 177.15 Gb of filtered Illumina short-read sequencing data (Table S2) were obtained from the L. oiostolus genome.

Genome survey and assembly

Using the kmer-freq subroutine in the software GCE v1.0.0, the genome size, heterozygosity, and duplication ratio were estimated based on the k-mer distribution of 19-mers extracted from Illumina short reads. A total of 223.91 Gb raw PacBio subreads were filtered and corrected with the CCS pipeline v6.0.0 (parameters: -min-passes 3 –min-snr 2.5 –top-passes 60)15. The resulting CCS reads were subjected to hifiasm v0.14.216 for de novo assembly. We corrected the primary contigs with the pilon v1.2317 using 89.12 Gb (70.35×) of Illumina paired-end reads. Using the samtools v1.918 and bwa v0.7.1714, after which the processed reads from the Illumina sequencing platform were mapped to the reference genome. The mapping rate and coverage were counted to confirm the consistency and integrity of the assembled genome. The genome size is 2.83 Gb with 0.72% heterozygosity and 36.55% duplicate repeatability, as estimated by K-mer-based methods (Table S3; Figure S1). The resulting assembly was 2.86 Gb with a contig N50 of 64.25 Mb and the longest contig being 147.07 Mb in length (Table 1; Fig. 1). 99.99% of Illumina short reads were successfully aligned to our assembly (Table S4).

Table 1 Features of the assembled genome.
Fig. 1
figure 1

Distribution of various elements on the chromosomes of L. oiostolus. From the outer circle to the inner circle: (A) Chromosomes karyotype. (B) Gene density. (C) Distribution of GC content in the genome. (D) Repetitive elements density. (E) Schematic presentation of major inter-chromosomal relationships in the L. oiostolus genome.

Hi-C analysis was used to generate the contig assembly at the chromosome level. The raw data were filtered using a perl script as implemented in the software LACHESIS v1.5715. To obtain uniquely mapped read pairs, the cleaned data were aligned to the initial genome assembly using BWA-MEM. Approximately 1.18 Gb of clean data were generated (Table S5). The hicup v0.8.016 was used to evaluate the validity of the Hi-C data based on uniquely mapped read pairs. We only used valid read pairs for the draft genome re-correction and chromosome-level genome assembly. We further applied ALLHIC pipeline v0.9.1217 to link the contigs into 24 pseudo chromosomes. Finally, our team manually curated the Hi-C scaffolding based on the chromatin contact matrix in the Juicebox v1.11.0818. The scaffold assembly was obtained using the ALLHIC pipeline v0.9.1217 with 748.17 million uniquely mapped reads from cleaned Hi-C data (Table S6). A total of 2.80 Gb of sequences were anchored onto 24 pseudo chromosomes, accounting for 98.08% of the initial assembly (Figure S2, Table 2). In addition, Hi-C data were mapped against the Hi-C scaffold assembly, showing 74.31% valid sequences (Table S6). Genome-wide analysis of chromatin interactions showed a well-organized pattern of Hi-C signals along the diagonals, indicating a high-quality chromosomal genome assembly for L. oiostolus (Figure S2).

Table 2 Statistics of L. oiostolus genome sequence length (chromosome level).

Genome structure prediction and annotation

Both homology-based and de novo prediction methods were used to identify repetitive DNA elements. For the de novo prediction, RepeatModeler v1.0.1119 was used to construct an ab initio database of predicted repetitive elements, and RepeatMasker v4.0.920 was used to annotate the repetitive elements in the database. RepeatMasker and RepeatProteinMask were then used to search the genome sequence for known repetitive elements, with the genome sequences used as queries against the RepBase database v27.06 (http://www.girinst.org/repbase). Tandem repeats were also identified with the TRF method. Repetitive elements play an essential role in genome evolution. In the repeat annotation, a total of 1,423, 017,427 bp of transposon elements (TEs) comprise 49.84% of the L. oiostolus genome. Among all the classifications of TEs, LINE constituted the largest portion (Table 3, Figure S3).

Table 3 Transposable element (TEs) annotations in L. oiostolus.

Genome structure analysis was conducted using homology-based prediction, de novo prediction, and RNA-seq-based prediction. For homology-based prediction, the amino acid sequences from Aotus nancymaae, Callithrix jacchus, Carlito syrichta, Ochotona princeps, and Ochotona curzoniae were aligned to the L. oiostolus assembly by using Exonerate v2.4.021. For de novo gene prediction, Augustus v3.3.222,23, Genscan v1.024, and GlimmerHMM v3.0.425 were used to predict coding regions in the genome with internal gene models. We used Stringtie v2.1.126 to align assembled transcripts to the L. oiostolus genomic sequence and then TransDecoder v5.1.024 from the Trinity package to identify likely open reading frames within the transcripts. Finally, the gene sets obtained from the predictions of various methods were integrated using Maker v2.31.1025. After the characterization of repeat sequences, we used Maker v2.31.1025 to predict a consistent set of genes with 24,410 total genes annotated. Overall, 251 complete BUSCOs (98.4%), including 239 single-copy (93.7%) and 12 duplicate BUSCOs (4.7%), and one fragmented BUSCO (0.4%) were identified in the gene annotations (Table 4). Three genes (1.2%) were recognized as missing BUSCOs in our genome (Table 4).

Table 4 Summary statistics of predicted protein-coding genes.

Functional annotation of the predicted genes in the L. oiostolus genome was performed by alignment to the UniProt, Pfam, GO, KEGG, KEGG pathway, Interproscan, and NR databases using Diamond BLASTP v2.11.0 (parameter -evalue 1e-5)27 and KOBAS v3.028. Motifs, conserved sequences, and domains were annotated using InterProScan v5.33-72.029 and hmmscan v3.1 (parameter e-value 0.01)30. A total of 22,295 protein-encoding genes were functionally annotated in the L. oiostolus genome (Table S7).

Identification of non-coding RNA genes

Noncoding RNAs, including small nuclear RNAs (snRNA) and microRNAs (miRNA), were identified using INFERNAL v1.1.231 based on the Pfam database32. Predictions of tRNAs were generated using tRNAscan-SE v1.2333. The rRNAs and their subunits were predicted using RNAmmer v1.234. 463 miRNAs, 2,293 snRNAs, 1,053 tRNAs, and 541 rRNAs were predicted in the L. oiostolus genome (Table S8).

Data Records

The assembled genome has been deposited at DDBJ/ENA/GenBank under the accession JAWMBE00000000035. The raw reads were also deposited at Sequence Read Archive (SRA) at NCBI, respectively36, under Bioproject PRJNA1026309. Data of the gene functional annotations, and repeat annotation had been deposited at figshare37.

Technical Validation

The quality control of genome

The BUSCO v4.1.4 (parameters: -evalue 1e-05)38 was used to assess the completeness and accuracy of the assembled genome. For the BUSCO analysis, 96.2% of genes were completely recalled, 93.3% were single copies, and 2.9% originated from duplication events (Table S9).