Long read genome assemblies complemented by single cell RNA-sequencing reveal genetic and cellular mechanisms underlying the adaptive evolution of yak

Wild yak (Bos mutus) and domestic yak (Bos grunniens) are adapted to high altitude environment and have ecological, economic, and cultural significances on the Qinghai-Tibetan Plateau (QTP). Currently, the genetic and cellular bases underlying adaptations of yak to extreme conditions remains elusive. In the present study, we assembled two chromosome-level genomes, one each for wild yak and domestic yak, and screened structural variants (SVs) through the long-read data of yak and taurine cattle. The results revealed that 6733 genes contained high-FST SVs. 127 genes carrying special type of SVs were differentially expressed in lungs of the taurine cattle and yak. We then constructed the first single-cell gene expression atlas of yak and taurine cattle lung tissues and identified a yak-specific endothelial cell subtype. By integrating SVs and single-cell transcriptome data, we revealed that the endothelial cells expressed the highest proportion of marker genes carrying high-FST SVs in taurine cattle lungs. Furthermore, we identified pathways which were related to the medial thickness and formation of elastic fibers in yak lungs. These findings provide new insights into the high-altitude adaptation of yak and have important implications for understanding the physiological and pathological responses of large mammals and humans to hypoxia.

species share genetic features of high-altitude adaptation and are considered excellent models for studying hypoxia tolerance in large mammals.
In 2012, the genome of a female domestic yak was sequenced using the Illumina-based whole-genome shotgun method 6 . In 2020, the genome of a female wild yak was assembled with the Illumina data 7 , and chromosome-scale genomes of female domestic yak were constructed using long-read sequencing and chromatin interaction technologies 10,11 . However, previous studies on yak have primarily focused on single nucleotide variants to reveal the genomic diversity and historical population dynamics 6,[12][13][14][15][16][17][18] . Recent studies showed that structural variants (SVs), such as insertions, deletions, duplications, inversions, and translocations, are widely present in genomes 19 and provide an extensive source of genetic variations for identifying candidate genes involved in the regulation of critical biological processes 20,21 . The availability of high-quality reference genomes for taurine cattle, which were obtained using long-read sequencing technologies, has enabled the dissection of genetic basis of complex traits 22 . Assembling high-quality and complete reference genomes of wild and domestic yak is fundamental for deciphering the molecular mechanisms underpinning adaptation of yak and related species to extreme high-altitude environment on the QTP. Unfortunately, due to quality-related issues of the wild yak reference genome, SVs and SVrelated genes in wild yak and domestic yak have yet to be mapped and compared with those in taurine cattle genome in detail.
Unique genomic features and precisely controlled gene expression ensure the physiological adaptation of animals to high altitude. Non-native mammals are prone to altitude sickness, mainly manifested as pulmonary hypertension and right ventricular hypertrophy after their exposure to acute or long-term hypoxia [23][24][25] . Yak acquired specific anatomical and physiological characteristics to survive the oxygenpoor air and the harsh environment. It is known that their blood, lung, and heart systems have been evolved to meet the challenge at high altitude 23 . Wild and domestic yak are adapted to hypoxia by increasing hemoglobin content, red blood cell count and hematocrit 26 . The lung and heart weights of yak are higher than those of age-matched taurine cattle 25 . Among these tissues, the lung is the interface between environment and body, therefore it plays a pivotal role in high-altitude adaptation. Yak developed intense pulmonary blood vessels, which increase the oxygen exchange rate of pulmonary artery blood vessels and help relieve pulmonary artery pressure 23 . By comparing transcriptomic profiles among different tissues across species, it was reported that the lung exhibited adaptive transcriptional changes and expressed a higher number of positively selected genes 27 . Despite these findings, the cellular components, and gene expression dynamics of lung tissues in animals that are adapted to high altitude remain to be explored at single cell level.
In the present study, we used Nanopore sequencing and Hi-C data to assemble the high-quality genomes of a wild yak and a domestic yak. We used an alignment-based strategy to compare long-read sequencing data with taurine cattle, to identify SVs associated with high altitude adaptation. We also constructed the single-cell atlas of yak and taurine cattle lungs using single-cell transcriptome sequencing (scRNA-seq). Integrated genomics and transcriptomics analyses revealed a new subtype of endothelial cells and uncovered a list of genes and pathways that were associated with the development of unique structures in yak lung. All together, these data provided important information to understand genetic and cellular mechanisms underlying the adaptive evolution of yak.

Chromosome level genomes of wild and domestic yak
DNAs from an adult male wild yak and an adult male domestic yak ( Supplementary Fig. 1) were extracted for sequencing and genome assembly. A total of 157.17 Gb data from wild yak and 191. 17 Gb data from domestic yak were generated using Nanopore long-reading sequencing technology and wtdbg2 program 28 (Table 1).
Next, we evaluated the sequence consistency and integrity of the two newly assembled genomes. The comparison rate of all small fragments was 99.35% and the coverage rate was 99.19% in wild yak while those for domestic yak were 99.18% and 99.59%, respectively (Supplementary Table 7), indicating high quality in their sequence consistency and integrity. Because the heterozygous SNP ratios were 0.1371% and 0.1590%, and the homozygous SNP ratio were 0.0009% and 0.0004% for the wild yak and domestic yak genomes, respectively (Supplementary Table 8), we concluded that our genome assemblies had a relatively high single-base accuracy. CEGMA assessment showed that 94.76% CEGs (Core Eukaryotic Genes) were assembled in both wild and domestic yak genomes (Supplementary Table 9). BUSCO evaluation indicated that 93.3% and 93.2% of the complete single-copy genes in wild and domestic yak, respectively, were assembled from 4104 orthologous single-copy genes (Supplementary Table 10). Merqury evaluation results indicated that the qv value of domestic yak was 30.1682, while that of wild yak was 30.5424. All these results indicated relatively completed assemblies of both wild and domestic yak genomes.

Annotation for repetitive sequences and genes in wild and domestic yak
The two genomes of wild yak and domestic yak were then annotated for repetitive sequences and genes. The predicted de novo repeat sequence library was integrated with the homologous repeat sequence database (Repbase), and then the genomes were annotated for repeats with RepeatMasker 29 1c). In order to further determine the SVs that likely participate in The ratio of SVs  high altitude adaptation, we calculated the F-statistics (FST) for all SVs between taurine cattle and yak, and selected SVs with the top 0.5% FST values (High-FST SVs) as candidate sites. Interestingly, we found that FST values of these candidate SVs were 1 (FST = 1) and this accounts for 12.5% of the SVs ( Supplementary Fig. 12). Among these high-FST SVs, 24,468 SVs are (10.14% of all SVs) in the intronic and 712, 18,483, 174, 136 and 4,187 SVs were in promoter, intergenic, exonic, UTR or the 150 bp upstream and downstream flanking regions of genes, respectively (Fig. 1d). Of these, 11 genes contained deletions causing frameshifts in the ORF (Supplementary Table 22). We then annotated the 6733 genes carrying high-FST SVs and 897 genes carrying special SVs located in the exonic and promoter regions (Supplementary Data 1).
Integrating transcriptome and chromatin accessibility data to reveal potentially functional SVs Many SVs lie in the noncoding sequences, including promoter and UTR regions, and they are associated with the expression of nearby genes.
To interrogate the potential roles of SVs in influencing gene expression, we examined differentially expressed genes (DEGs) between six domestic yak and taurine cattle organs including heart, lung, kidney, spleen, muscle, and liver, using RNA-sequencing data from 48 samples (Supplementary Table 23). It should be noted that we did not find SVs in the yak-specific gene (Supplementary Table 24 . We next conducted the Chi-square test for detecting relationships between the DEGs and SV-carrying genes in six tissues, and the results suggested that the identified SVs were associated with the differential expression of genes in all tissues analyzed (All p values were <0.01; χ 2 value was from 9.15 in the heart to 41.14 in the lung) ( Fig. 2a  The number of DEGs in lung, heart, spleen, kidney, muscle, and liver between taurine cattle (n = 3 biologically independent samples) and yak (n = 4 biologically independent samples) with high-FST SVs. 5 cattle and 5 yak lung tissue transcriptome data were used for analysis. Different letters (a, b, c) denotes statistical significance at P < 0.05 (One-way ANOVA with Tukey's test for multiple comparisons; F = 1722; P < 0.0001 Genelist annotated with high-FST SVs located in exon and promoter. List3: Differentially expressed genes between domestic yak and taurine cattle in lungs. c The motif enrichment results of peak located in the promoter with high-FST SVs. The p values were calculated using Fisher's exact test. d The sequence information of five special motifs. Source data are provided as a Source Data file. genomic variation data, we discovered that 127 genes carrying high-FST SVs within the exonic and promoter regions were differentially expressed in lungs of the taurine cattle and domestic yak ( Fig. 2b; Supplementary Table 25). In addition, we performed motif enrichment analysis on the peak of the promoter regions carrying high-FST SVs using the MEME suite. As a result, 632 peaks were identified in the promoter regions carrying high-FST SVs and 446 motifs were obtained after enrichment analysis ( Fig. 2c; Supplementary Data 4). Particularly, 5 transcription factors that were previously identified to be important for the hypoxia response were enriched (Fig. 2d). ARNT participates in the hypoxia-inducible factor (HIF) signaling pathway 35 . GATA1 directs the expression of genes involved in heme biosynthesis, erythropoietin signaling, and anti-apoptotic and proliferation pathways, and is also required for EPOR expression 36 . MAFG interacts with HIF-1α and controls hypoxia response 37 . KLF5 is crucial for hypoxia-induced vascular remodeling and its expression is directly regulated by HIF-1α 38 . HOXB5 controls airway and alveolar development through cell-cell communication between the mesenchyme and epithelial cell compartments 39 .
Together, these data revealed that the SVs in the promoters of hypoxia-related genes have potential functions in gene expression regulation.

Single cell RNA-seq analysis of lung tissues of domestic yak and taurine cattle
Although the lung is one of the most important organs mediating physiological adaptations to hypoxia, the diversity of cell populations and gene expression profiles at the single cell level has not been studied in lungs of high-altitude animals. To that end, we digested the domestic yak (n = 5) and taurine cattle lung tissues (n = 5) and obtained single cells for RNA-seq analysis ( Supplementary Fig. 13). After two library constructions, a total of 31579 domestic yak cells and 27951 taurine cattle cells were harvested and sequenced. After cell filtration and integration of two samples, we used Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction strategy 40  Similarly, these four cell types of epithelial cells (8.6%, expressing SEC14L3, SCGB3A2, and SFTPC), endothelial cells (5.3%, expressing MMRN1, CALCRL, and CCL21), mesenchymal cells (4.5%, expressing COL1A1, COL3A1, and DCN), and immune cells (81.6%, expressing CD3E, CCL5, and IL1RN) were also present in the taurine cattle lung ( Fig. 3a; Supplementary Fig. 15). These four cell types were present in mouse lung, suggesting that the major cell components of lung tissues were conserved among species 41 . SFTPC and SCGB3A2, the two genes that were specifically expressed in lung tissues, exhibited their enrichments in yak15 and yak12 cell clusters (epithelial cells) (Fig. 3b). EPAS1 gene, which is related to high altitude adaptation 42 , was enriched in yak13 clusters (endothelial cells) (Fig. 3b). Furthermore, we uncovered new markers for all cell types, for instance, CYP2B6 and TPPP3 in epithelial cells, EPAS1 in endothelial cells, COL3A1 in mesenchymal cells, and MS4A1, and BRB in immune cells ( Supplementary Figs. 14 and 15). Based on the results of the Pearson correlation coefficients 43,44 , we concluded that the relatively correlations were weak between the yak16, a type of endothelial cell, and all the 20 clusters of taurine cattle cells (the r-value of each cell subgroup of the cattle <0.31) ( Fig. 3c; Supplementary Data 5). The Yak16 cluster likely corresponded to a new population of cells that highly expressed TXNDC5, TENM1, and ZCCHC14 which contained high-FST intronic SVs ( Fig. 3d; Supplementary Data 1). TXNDC5 (Thioredoxin Domain Containing 5) promotes fibroblast activation, proliferation, and excessive ECM (extracellular matrix) production by enhancing TGFβ signaling activity through post-translational stabilization and upregulation of Tgfbr1 in lung fibroblasts 45 (Fig. 3d). TENM1 (Teneurin Transmembrane Protein 1), which is involved in neural development 46 (Fig. 3d). ZCCHC14 (zinc finger, CCHC domain containing) deletion in mice causes microcephaly, intellectual disability, bilateral vesicoureteral reflux, and bulbar venous malformations 47 (Fig. 3d; Supplementary Fig. 16).

Identification of gene expression patterns and cellular components in domestic yak and taurine cattle lungs
In addition to the differences in cell components, we proposed that differential gene expressions in the same cell population may also contribute to the adaptation of yak. We found that mesenchymal cells (yak-11 and cattle-12&13 clusters) had 1024 DEGs (Supplementary Data 6). Gene enrichment analysis of these DEGs identified pathways of blood vessel development and morphogenesis ( Fig. 4a; Supplementary Data 7). Blood vessel development and angiogenesis in lung are crucial for the establishment of adaptive mechanisms in hypoxic environment 23 . The LOX gene, which is localized at the elastin/microfibril interface in aorta and cross-links both elastin and collagen, and involved in elastic fiber assembly 48 , exhibited enrichment in domestic yak mesenchymal cells (Fig. 4b). Indeed, histological examination of lung tissues using hematoxylin and eosins (H&E) staining uncovered that the density of pulmonary alveoli and relative mean single alveolar area did not differ, however, the medial thickness of micro-vessels was significantly thicker in domestic yak than taurine cattle, which was likely related to difference in blood vessel development and mesenchymal cell proliferation between two species (T-test; P = 0.0031; two-tailed; t = 4.767; df = 8) (Fig. 4c-e). Importantly, quantitative analysis using Gomori staining revealed that elastic fiber content was significantly higher in domestic yak than taurine cattle (T-test; P = 0.0084; twotailed; t = 3.860; df = 8) (Fig. 4c-e).
To examine the differences in cellular components and gene expression features, we used the canonical correlation analysis (CCA) to perform an integrated analysis of single-cell RNA-seq data from domestic yak and taurine cattle 49,50 . A further UMAP analysis identified six cell clusters at the resolution of 0.1 (Fig. 5a). Four major cell types were identified, including epithelial cells (n = 1639, expressing SEC14L3 and CYP2B6), endothelial cells (n = 1095, expressing MMRN1 and CCL21), immune cells (n = 16,255, expressing PLEK and IL7R), and mesenchymal cells (n = 737, expressing DCN and COL1A1) (Fig. 5b). Out of them, cluster 11 constituted 71.44 % of all cells in domestic yak while cluster 6 contained 68.46% of all cells in taurine cattle, both were endothelial cells (Fig. 5c).
Integrating genomic variants and single-cell RNA-sequencing data to reveal genetic underpinning of adaptation in domestic yak We then constructed plausible scenarios of causal relationships between genomic variants and development of novel cell clusters. We examined the enrichment of genes harboring high-FST SVs, DEGs, and DEGs with high-FST SVs in each sub-category of the domestic yak and taurine cattle lung cells. The T-test results showed that the cell subgroup expressed the highest proportion of marker genes carrying high FST-SV were endothelial cells in the lung of taurine cattle (Cattle6 clusters, expressing EPAS1 and ADGRF5; t = 2.347, two-tailed, n = 40 and P = 0.0243) (Fig. 6a-d)   domestic yak (Yak5, expressing THBS1 and CXCL3; T-test, t = 2.199, twotailed, df = 42 and P < 0.05) (Fig. 6a-d).
Finally, we compared the proportion of SV-carrying genes, DEGs, and SV-carrying DEGs in marker genes of four cell types between domestic yak and taurine cattle ( Supplementary Fig. 17). The results showed that mesenchymal cells (T-test; P = 0.0167; two-tailed; t = 3.955; df = 6) and the epithelial cells (T-test; P = 0.0122; two-tailed; t = 3.220; df = 10) of yak expressed more genes harboring high-FST SVs than those of cattle, however, the endothelial cells (T-test; P = 0.0091; two-tailed; t = 3.422; df = 10) of cattle expressed more DEGs than that of yak. Together, outcomes of these analyses suggested that the detected SVs were associated with differentially gene expression in different subsets of lung cells and likely had crucial roles in high altitude adaptation of yak.

Discussion
Next-generation sequencing is a powerful tool for studying genomic variations, however, the limitation of the genome assembly using short reads makes it impossible to detect SVs thoroughly and accurately 30,31 . In contrast, Nanopore (Oxford Nanopore Technologies) has the advantage of long reads and has been proven to be effective in solving complex genomic structures 31,32 . In addition, the application of Hi-C technology provides a complementary approach for genome assembly from scratch. In the present study, with the aid of Nanopore sequencing and Hi-C data, two long-read yak genome assemblies of wild and domestic yak were constructed and released.
Although a few yak genomes have been assembled previously compared to the assembly quality of cattle 22 , cattle-yak hybrid 51 , water buffalo 52 , goat 53 , and pig genomes 54 , the quality of yak genomes needs to be improved. In this study, the lengths of contig and scaffold N50 of wild yak were 38.28 Mb and 103.90 Mb while the domestic yak assembly reached to 44.91 Mb and 104.02 Mb, respectively. The N50 estimates of these two new genomes were comparable to those of the taurine cattle genome (ARS-UCD1.2, the contig N50 at 25.89 Mb, and the scaffold N50 at 103 Mb) but better than those of existing reference genomes of wild yak (the contig N50 was 63.2 kb and the scaffold N50 was 16.3 Mb) and domestic yak (BosGru_PB_v1.0, the contig N50 at 14.74 Mb and the scaffold N50 at 101.61 Mb) 7,11 . Verified by both BUSCO and CEGMA, the integrities of these two genomes were of high quality, therefore this work will facilitate future investigation of yak genetics and genomics.
Previous studies have identified some candidate genes related to high altitude adaptation of yak, such as positively selected genes of ADAM17, ARG2, and MMP3 by the branch-site likelihood ratio test 6 . Using a comparative transcriptomic analysis approach, it was uncovered that the heart and lung tissues exhibited the largest differences in gene expression profiles among the four organs examined (heart, liver, kidney, and lung) between yak and taurine cattle 55 . Based on secondgeneration short-read data, these studies provided important information on identifying candidate genes, however, the lack of highquality genomes has hindered the efforts for assessing the impact of SVs in the yak genomes. SVs can affect the traits of animals, such as  white or white-spotted pigs 56 and white band in taurine cattle 57 , and thus provide an extensive source of genetic variations for identifying candidate genes in important biological processes. A recent study performed SV detection and comparison between domestic yak and wild yak 10 , and uncovered that 3680 SVs were under artificial selection. Importantly, more than 700 genes carrying high-FST SVs were involved in nervous system development, neuron differentiation, and behavior, suggesting possible roles of the SVs in yak domestication.
Although the present study also annotated the SVs in domestic yak and wild yak, we did not describe the additional SVs in yak domestication. The focus of this work was to identify a list of candidate genes carrying high FST SVs by screening genomes of taurine cattle and yak and to further illustrate the potential roles of these genes in yak adaptation. We discovered different types of the SVs including deletions, insertions, duplications, and inversions. A limitation was that we could not select enough number of translocations for detailed analysis using current approach, although such variants were proven to be crucial for regulating gene expression and determining phenotype. For example, white-sidedness in cattle is due to serial translocation events involving KIT locus 57 . The current work primarily examined and described the SVs in exonic regions that affect gene expression, and noncoding SVs may have important consequences on influencing the expression of nearby genes and causes phenotypic variation 58 . These genes may have played crucial roles in directing the development of hypoxia tolerance at the cellular and physiological levels.
Single cell RNA-seq analyses revealed the cell components and developmental trajectories of lungs in humans, mice, and other animals. Single-cell RNA-seq data from neonatal mouse lungs suggested distinct populations of epithelial, endothelial, mesenchymal, and immune cells 41 . Single-cell map of human lungs identified 17 molecular types that either have been gained or lost during evolution or displayed substantially altered expression profiles 59 . The scRNA-seq of adult mammalian lungs, including humans, mouse, rats and pigs, revealed alveolar type I cells to play a major role in the regulation of tissue homeostasis 60 . We constructed the lung cell map of ruminants and found a unique endothelial cell population in the domestic yak lung tissue. Based on the single-cell transcriptomic and histological data, we provided candidate genes that may be related to the development of unique structure of domestic yak, including more elastic fibers and medial thickness of micro-vessels.
In summary, we present two high-quality chromosome-level genomes of wild and domestic yak. We also provided the genome-wide catalogue of SVs in yak and the single-cell transcriptomic profiles of the bovine lung tissues. These high-quality genomes and single-cell RNA-seq data serve an important source for future research on bovine species.

Animals
Genomic DNAs were extracted using a standard phenol-chloroform method 8 from the blood of an adult male wild yak and an adult male domestic yak that lived above 4200 m in Qumalai County of Qinghai Province, China. Animal experiments were approved by the Animal Ethics and Welfare Committee at Northwest Institute of Plateau Biology, Chinese Academy of Sciences.

Genome sequencing
Nanopore: After assessing the quality of the DNA, we constructed a Nanopore library with an insert size of 20 kb and utilized the Pro-methION platform to perform long-read sequencing. Hi-C: the wild yak and domestic yak blood samples were fixed in 37% formaldehyde solution. The nuclear chromatin was obtained from the fixed samples and digested using a 4-cutter restriction enzyme MboI (New England Biolabs, USA). The overhangs resulting from digestion were blunted by biotin-14-dCTP (Invitrogen, USA) and blunt-end ligation of the crosslinked fragments. Then, the genomic DNA was extracted by using the phenol-chloroform method. Biotin was removed from non-ligated fragment ends using T4 DNA polymerase (Thermo Scientific, USA).
Ends of sheared fragments by sonication were repaired by the mixture of T4 DNA polymerase (Thermo Scientific, USA), T4 polynucleotide kinase (Thermo Scientific, USA), and Klenow DNA polymerase (Invitrogen, USA). Biotin-labeled Hi-C samples were specifically enriched by using streptavidin C1 magnetic beads (Invitrogen, USA). After adding A-tails to the fragment ends and following ligation by the illumina

Estimation of genome size
The k-mer algorithm was applied to evaluate the domestic yak and wild yak genome size 61 . The 17 k-mer and 171.32 Gb next-generation sequencing reads were utilized for domestic yak, while 21 k-mer and 343.13 Gb next-generation sequencing reads for wild yak.

Genome annotation
Repeat annotation: De novo and homology approaches were combined to identify repetitive sequences in the wild and domestic genomes. Firstly, we constructed a de novo repeat library using RepeatModeler 65 (v2.0.1) with the default settings. Secondly, RepeatMasker 29 (v4.1.0) was run on the wild and domestic yak genomes using the de novo library. RepeatMasker 29 and its in-house scripts (RepeatProteinMask) was also run against the RepBase for homologous repeat identification. Gene structure annotation: Homology-based prediction, ab initio prediction, and RNA-seq assisted prediction were used to annotate all potential genes. (1) Homolog prediction: Sequences of homologous proteins of six related ruminant species, including buffalo (Buba-lus_bubalis), cattle (Bos_taurus), sheep (Ovis_aries), goat (Capra_hircus), European bison (Bison_bonasus) and reindeer (Rangifer_tarandus), were downloaded from Ensembl/NCBI/Gigadb database. These protein sequences were aligned to the wild and domestic yak genomes using TblastN 66  3) was subsequently used to predict gene models with default parameters. All predicted genes from the three approaches were integrated with EVidenceModeler (EVM) 73 (v1.1.1) to generate high-confidence gene sets.
Gene function annotation: Gene functions were assigned according to the best match by aligning the protein sequences using Blastp 74 (v2.2.26, E-value <10 −5 ) to protein databases including the SwissProt, Nr, Pfam, KEGG, and InterPro. The best hits were used to assign homology-based gene functions. Functional classification based on GO categories and InterPro entries was achieved using the InterProScan (v5.35).

Structural variant analysis
The genome of a female taurine cattle (ARS-UCD1.2) and the Y chromosome of other male taurine cattle (Btau_4.6.1) were downloaded and merged into the taurine cattle genome file. For Nanopore sequenced data, we aligned each sample to ARS-UCD1.2 using NGMLR (v0.2.7) with the option '-x ont', while without 'the option '-x ont'for Pacbio sequenced data. Alignments were sorted and indexed by samtools (v1.8). The program cuteSV (v1.0.11) was used to call SVs for each individual with important parameters 1 (-max_cluster_bias_INS 100-diff_ratio_merging_INS 0.3-max_cluster_bias_DEL 100-diff_ratio_ merging_DEL 0.3) for Nanopore data and important parameters 2 (-max_cluster_bias_INS 100-diff_ratio_merging_INS 0.3-max_cluster_ bias_DEL 200-diff_ratio_merging_DEL 0.5) for pacbio data. We merged all VCF files to obtain a total SV data set by SURVIVOR (v1.0.7) with parameters '50 1 1 −1 −1 −1'. cuteSV with option '-genotype' was used again to perform genotyping for each sample at all SV breakpoints based on the merged VCF file and all mapping results. Then we performed a filtration to remove unreliable SVs using vcftools (v0.1.17) with parameters (-max-missing 0.5-mac 3-minQ 30). The python script was used to calculate the number of SVs on each chromosome, the proportion of different SVs, the size distribution of SVs, the chromosome distribution of different SVs, and the sample distribution of different SVs. The filtered SVs were annotated used VEP (Web version).
We verified the selected SVs through PCR. DNAs were extracted from testes of three male taurine cattle and three male domestic yak using a standard phenol-chloroform method 8 in Xining County of Qinghai Province, China. The regions of selected SVs were amplified using the primers (Supplementary Table 26). The PCR program consisted of an initial denaturing step at 95°C for 4 min, 35 amplification cycles (95°C for 55 s, 60°C for 55 s, and 72°C for 55 s), and a final extension at 72°C for 5 min in a thermocycler (Eppendorf). a The heatmap shows the normalized ratio of SVgene, DEGs and SVDEGs in the marker genes of each cluster in the taurine cattle lung (n = 5 biologically independent samples) (Note: *P = 0.0243; **P = 0.0038; ***P = 0.0007; T-test; two-tailed). b The heatmap shows the normalized ratio of SVgene, DEGs, and SVDEGs in the marker genes of each cluster in the domestic yak lung (n = 5 biologically independent samples) (Note: *P = 0.0337; T-test; two-tailed). c The bubble chart shows the overlap significance of the three gene lists and the marker genes of each cluster in the taurine cattle (n = 5 biologically independent samples) lung respectively (Fisher's exact test). d The bubble chart shows the overlap significance of the three gene lists and the marker genes of each cluster in the domestic yak lung (n = 5 biologically independent samples) (Fisher's exact test). SVgene: genes harboring high-FST SVs; DEGs: differentially expressed genes between the domestic yak and the taurine cattle lung; SVDEGs: DEGs harboring high-FST SVs (Note: significance, P < 0.05; Confidence interval = 95%; Df = 1; two-sided). Source data are provided as a Source Data file.
Identification and enrichment analysis of differentially expressed genes The transcriptomic data of yak and taurine cattle were retrieved (Supplementary Table 23). The clean data were aligned to the reference genome (ARS-UCD1.2) using hisat2 (v2.2.1). After alignments were sorted and indexed by samtools, counts are obtained through the featureCounts program with parameters (-p -t exon -g transcript_id). DESeq2 75 package (v1.30.1) was used to analyze the differentially expressed transcript (DET) between the taurine cattle and yak lungs. The R script is used to calculate the p-value of each transcript by the ttest, and DETs are screened by p-value <0.05 and |log2FoldChange| > = 1. Gene ID conversion and enrichment analysis were performed through the g:Profiler site 76 . Miropeats 77 (v2.02) was used to visualize the SVs in specific genes (TXNDC5, TENM1 and ZCHHC14). Gene expression histograms were drawn using Graphpad prism software (v8.0). The Veen diagrams of differentially expressed genes (DEGs) and the genes with SVs were drawn. The one-way ANOVA and multiple comparisons was done with the GraphPad Prism software.

Analysis of Yak-specific gene expression
Orthofinder software (v2.5.4), domestic yak, and cattle (ARS-UCD1.2) amino acid files were used to calculate yak-specific genes 78 . Transcriptome data from yak lung were used to realign to the domestic yak reference genome using hisat2, and then the count value was calculated using the featureCounts program, and finally the normalized count value was calculated using the DEseq2 package.

Single-cell transcriptome sequencing
The lung tissue samples were collected from 5 adult male taurine cattle in Zhangye (Gansu province, China) and 5 adult male domestic yak in Xining (Qinghai province, China). The samples were transported to the lab on ice and digested with 5 ml 1 mg/ml collagenase for 20 min until the lung tissues were digested into single cells. After adherent cells were discarded using a 40 µm cell filter, the single cell suspension was centrifuged with 400 g for 5 min and then suspended in 5 ml Dulbecco's phosphate-buffered saline (DPBS) containing 10% FBS (DPBS + S). Red blood cells were removed using Solarbio erythrocyte lysis buffer (3:1; Beijing, China). After centrifugation, sample buffer (BD Biosciences, New Jersey, USA) was added for single cell RNA sequence determination. BD Rhapsody TM Single Cell Analysis System (DOCID: 210966 Rev. 1.0 protocol) was used for single-cell cDNA synthesis. The library preparation was performed according to the BD Rhapsody TM Whole Transcriptome Analysis (WTA) Amplification Kit and BD AbSeq library preparation protocol. In particular, we used three taurine cattle samples to build the library, and two taurine cattle samples for the second time. For the five yak samples, we adopted the same method to build the library. Double-ended sequencing of 150 bp was performed on the IIIumina HiSeq 2000 platform by Novogene (Beijing, China).

Single-cell transcriptome data analysis
Quality control and analysis of the original data were performed according to BD Phapsody pipeline. STAR (v2.7.1a) was used to index the reference genome (Taurine cattle: the same as the reference genome of sv calling; Yak: the newly assembled domestic yak reference genome in this study) 79 . After quality filtering, reference gene comparison, expression quantification, and a cell-gene expression matrix file were generated following BD pipeline. The expression matrix obtained from the first and the second are integrated by CCA and then was processed by cluster analysis using Seurat toolkit 50 (v3.2.0) with following protocols: (1) Quality control and cell filtration: Low-quality cells were filtered according to the expression of genes. The filtering standards were <300 and <500 detected (transcript count > 0) genes for the taurine cattle and yak lungs. >20% of transcript counts were mapped to mitochondrial genes for the lungs of both species; (2) Data standardization and noise reduction: The default global normalization method "LogNormalize" of Seurat toolkit was used to standardize the gene expression matrix; (3) Cell clustering: Principal Components Analysis (PCA) was performed, then the most significant principal components were evaluated, and the first 20 principal components were selected for cluster analysis. According to the clustering results, the UMAP method 40 was used to visualize the distribution of cells in two-dimensional space, and the cells of the same cluster were represented by the same color (important parameter: dims = 1:20, check_duplicates = FALSE); (4) Search and visualization of marker genes: Seurat toolkit was used to calculate each marker gene and its expression level in each cluster with default parameters and filter marker genes with special parameters (p_val_adj <0.05 and avg_logFC > 1). Special marker genes were displayed in violin map using Seurat toolkit and the heatmap was generated using ComplexHeatmap package 80 (v2.6.2). (5) Differential expression analysis: Use FindMarkers function to detect differentially expressed genes in mesenchymal cells of cattle and yak with special parameters "min.pct=0.25". (6) Functional enrichment analysis: g:Profiler web server was used to analyze the GO functional and KEGG pathway enrichments of each set of differentially and highly expressed genes. The ggplot2 package (v3.3.3) was used to draw bar charts for visualization of the results. (7) We computed the Pearson correlation coefficient between the yak and taurine cattle lungs using corr.test function in psych package (v2.0.8).

Two-sample integrative analysis
The two expression matrices filtered in single sample analysis were used for two-sample integrative analysis. The FindIntegrationAnchors function in the Seurat toolkit was applied to identify the anchors which were qualified for the IntegrateData function in the Seurat toolkit, and the integrated data were returned to a Seurat toolkit object, which contained a new Assay (integrated), which stored the integrated expression matrix (Important parameters: dims = 1:30). The UMAP method 40 was used for cell clustering (Important parameters: FindClusters (resolution = 0.1)); RunUMAP (reduction = "pca", dims = 1:30). The Seurat toolkit was used to calculate each marker gene and its expression level in each cluster (important parameters: FindMarkers (default parameters); and filter (p_val_adj <0.05 and avg_logFC > 1)). The ComplexHeatmap package was used for visualization of the marker genes in a heat map. The ggplot2 package and the Seurat toolkit were used to draw a histogram and a scatter diagram.
Integrated analysis of scRNA-seq, genes with SVs, DEGs, and DEGs with SVs Four gene lists including (1) The genes with high-FST SVs; (2) The DEGs between the taurine cattle and yak lungs; (3) The intersection of the first two genes; and (4) The marker gene of each cluster were used for integrated analysis. The proportions of the first three gene sets in each cluster (marker gene set) were calculated and normalized. The chisquare test was done with the stats R package. The p-value and number were determined using ggplot2 package. The T-test were done with the GraphPad Prism software.

Elastic fiber and Hematoxylin & Eosin (HE) staining
Lungs tissues from taurine cattle and yak were fixed in 4% paraformaldehyde and cut in serial paraffin sections (5-7 μm in thickness). Sections were processed for HE staining or elastic fiber staining. Elastic fibers were stained in aldehyde fuchsin solution for 10 min. Digital images were captured with a microscope (Nikon ELIPSE E200, Japan). All quantitative data were counted for at least three independent biological replicates. The two-tailed t-test analysis function of the GraphPad Prism software was used to statistically determine the difference between the means and the significance was set to P < 0.05.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
We have deposited the assembled genome, raw Hic data, and raw Nanopore data of wild yak and domestic yak at Sequence Read Archive (SRA) database of National Center for Biotechnology Information (NCBI) database with accession BioProject codes: PRJNA720245 and PRJNA720246. And the lung single-cell RNA-Sequencing data of domestic yak and taurine cattle have been at the NCBI database with accession BioProject codes: PRJNA720247 and PRJNA720248. Source data for Figs. 1a, c, d, 2a, c, 3c, 4d, e, 5c, 6a-d, Supplementary Fig. 11, Supplementary Fig. 16, and Supplementary Fig. 17 are provided with this paper. Source data are provided with this paper.