Patterns of DNA variation between the autosomes, the X chromosome and the Y chromosome in Bos taurus genome

The new ARS-UCD1.2 assembly of the bovine genome has considerable improvements over the previous assembly and thus more accurate identification of patterns of genetic variation can be achieved with it. We explored differences in genetic variation between autosomes, the X chromosome, and the Y chromosome. In particular, variant densities, annotations, lengths (only for InDels), nucleotide divergence, and Tajima’s D statistics between chromosomes were considered. Whole-genome DNA sequences of 217 individuals representing different cattle breeds were examined. The analysis included the alignment to the new reference genome and variant identification. 23,655,295 SNPs and 3,758,781 InDels were detected. In contrast to autosomes, both sex chromosomes had negative values of Tajima’s D and lower nucleotide divergence. That implies a correlation between nucleotide diversity and recombination rate, which is obviously reduced for sex chromosomes. Moreover, the accumulation of nonsynonymous mutations on the Y chromosome could be associated with loss of recombination. Also, the relatively lower effective population size for sex chromosomes leads to a lower expected density of variants.

www.nature.com/scientificreports/ contig length (chromosome), we can see that BTA25 has the highest ratio (38.04), while the Y chromosome has the lowest ratio (4.76). It is also worth to mention, that the karyogram of cattle chromosomes reports BTA29 as the shortest chromosome. BTY contains the lowest number of genes (206), while chromosome 23 contains the highest number of genes (1,708). Clearly, chromosome length is not linearly related to the number of genes. Moreover, 29 autosomes are acrocentric, while both sex chromosomes (X and Y) are submetacentric. Since recombination that generates new combinations of alleles is characteristic to autosomes, on sex chromosomes, this phenomenon is reduced in males to only small homologous regions shared between BTX and -Y, called the pseudoautosomal regions (PARs). BTY is poorly characterized because it is difficult to sequence due to the occurrence of a high proportion of repetitive sequences 4 . Differences in genetic variation patterns arise from variable recombination and mutation rates, genetic drift, demography, selection, and population history. Therefore the focus of our study was on the comparison of patterns of genetic variation between autosomes, the X chromosome, and the Y chromosome in the context of the bovine genome.

Results
Alignment to the reference genome. The quality of alignment was expressed by the total percent of mapped reads and the percent of properly paired mapped reads (both reads are mapped close to each other in opposite directions on the same chromosome). The percent of mapped reads for each individual was very high and ranged from 91.4% to 99.9% with mean 99.7% ( ±0.6 ) and very similar mode 99.9%. The percent of properly paired mapped reads was also high and varied between 89.3% and 99.1% with mean 97.1% ( ±1.7 ) and mode 97.9%. Average genome coverage was calculated separately for each individual and ranged from 5.28 to 46.79 with mean and mode both equal to 25 (Fig. 1). Individuals average genome coverage less than 15, while two individuals have the average genome coverage above 40. The individual with the lowest average genome coverage (5.28) contains 99.6% of mapped reads and 97.8% of properly paired mapped reads. The individual with the highest average coverage (46.79) contain 99.9% of mapped reads and 97.9% of properly paired mapped reads.  (Fig. 2). The length of InDels varied between 1 and 281 bps on autosomes, from 1 to 233 bps on BTAX, and from 1 to 156 bps on BTY and is not uniformly distributed ( P < 0.001 ). The most frequently observed length of InDels was one bp, but the median length was two bps (Fig. 3). InDel median length differed significantly ( P < 0.001 )  www.nature.com/scientificreports/ between autosomes and BTX as well as between autosomes and BTY. 12%, 19%, and 98% of indels located respectively on autosomes, BTX and, BTY, was longer or equal 10bps.
Distribution of polymorphisms. The analysis of variant distribution was carried out by counting the number of variants within 100 kbp non-overlapping windows (Fig. 4). The number of SNPs per 100 kbp bin was heterogeneous across windows within the same chromosome ( P < 0.001 ). The highest number of SNPs in one window (16,184) was observed on BTA12. On the other hand, on four chromosomes we identified windows without SNPs: BTA8 (one window), BTA9 (eight windows), BTA10 (eight windows), and BTY (58 windows). All of the windows were annotated as intergenic. However, 27 out of the total 58 windows on BTY corresponded to gaps in the reference genome marked by stretches of Ns. Conversely, windows with the number of SNPs exceeding 10,000 were found on BTA4 (one window with 10,838 SNPs) 8% of window length overlapped with genes, BTA12 (four windows with 10,431, 11,487, 15,097 and 16,186 SNPs) in 3% of windows length overlapped with genes, and BTA23 (three windows with 11,706; 12,452 and 13,229 SNPs) in which 100% of windows length overlapped transcription regions. Interestingly, the functional annotation showed that four out of the eight SNP-rich windows overlapped with lncRNA genes. The highest overall number of InDels was identified on BTA23 (2,930 InDels). InDel density (Fig. 5) was not uniform across the genome ( P < 0.001 ). We identified 122 windows without InDels-seven on BTA9, eight on BTA10, and 107 on BTY. All the windows on BTA7 and BTA8 were intergenic. On BTY, 64 of the windows were intergenic, whereas 35 windows corresponded to gaps in the reference assembly. However, eight windows intersected 10 genes. 1% to 13% of these windows overlapped with the sequence of a gene. Windows harboring more than 2,000 InDels were found on BTA4 (one window with 2,162 InDels), BTA12 (two windows with 2,405 and 2,792 InDels), and BTA23 (three windows with 2,022; 2,258, and 2,930 InDels). Each of these windows www.nature.com/scientificreports/ overlapped with genes comprising from 33% of a window length to 100% of length, both on BTA23. As for SNPrich windows, three out of the six InDel-rich windows overlapped with lncRNA genes.

Variant annotation.
The annotation of variants expressed as the percent of polymorphic bases located in non-coding and coding regions in relation to their length was summarized in Fig. 6. For each chromosome, the proportion of number of polymorphic sites to the length of the specific region (coding or non-coding) was higher for coding regions in comparison to non-coding regions. The proportion of the number of variants www.nature.com/scientificreports/ overlapping coding regions to the total length of this region was the highest on autosomes (9.90 ± 3.20), while the lowest on BTY (1.83). The proportion of the number of variants overlapping non-coding regions to the total length of this region was also observed on autosomes (0.65 ± 0.09) with the lowest proportion on BTY (0.10).
Focusing only on genes, on autosomes and BTX we observed the highest percent of variants was located in introns (41.8% ± 2.9% and 42.6%, respectively) and in non-coding transcripts (41.7% ± 2.4% and 41.7%, respectively). The Y chromosome showed the highest percent of variants in regions annotated up to 5,000 bps upstream of the 3' end of gene start (25.3%) and up to 5,000 bps downstream of the 5' end of gene end (24.8%). It was accompanied by a proportion of intronic and non-coding transcript variants being by 20% lower than in BTA and BTX. Moreover, variants with high (splice acceptor, splice donor, stop gained, frameshift variant, stop lost, start lost) and moderate (missense) impacts on proteins occurred more frequently on BTY than on BTA and BTX, while the latter had the lowest percent of variants with high or moderate impacts. Interestingly, on BTY 0.11% of variants in transcription regions were frameshift, which is much higher than on BTA (0.02%) and www.nature.com/scientificreports/ BTX (0.01%). The same tendency was observed for stop gained variants, which constituted 0.03% of BTY, but only 0.004% of BTA and 0.003% of BTX. Nonsynonymous and missense variants were examined in order to predict whether an amino acid substitution affects protein function. For this purpose, the SIFT score was predicted. A variant with score 0 to 0.05 is considered deleterious while > 0.05 are considered tolerated. Figure 7 reports the distribution of SIFT scores among nonsynonymous and missense variants. We can see, that the distributions of SIFT scores for autosomes and the X chromosome are the same ( P = 0.76 ), while differences between the X chromosome and the Y chromosomes as well as between the autosomes and the Y chromosomes are different ( P < 0.001 ). The Y chromosome has the highest median of the SIFT score.
Population genetics. The ratio of nonsynonymous to synonymous SNPs ( K a /K s ratio) was the highest for BTY (2.00) and the lowest for BTX (0.62). Autosome averaged ratio was also below unity (0.79 ± 0.15) and varied between 0.57 for BTA22 and 1.10 for BTA4, BTA15, BTA18, and BTA23. Tajima's D statistic significantly differed between all three chromosome groups ( P < 0.001 ) and was positive for autosomes and negative for both sex chromosomes (Fig. 8). Nucleotide diversity had the highest median for autosomes, followed by BTX and BTY (Fig. 9). All differences in distributions of nucleotide divergence between chromosome groups were significant ( P < 0.001 ). Furthermore, windows with extremely high diversity and windows with extreme values of Tajima's D statistic were examined for their SNP density. These windows do not overlap regions with high SNP density.

Discussion
Our goal with this study was to compare patterns of genetic variation between autosomes, the BTX, and the BTY. A high percent of mapped reads, a high percent of properly mapped paired reads, and the fact that the most of individual genomes were sequenced with a genome average coverage exceeding twenty, demonstrated high quality of our data allowing for reliable inferences. Interestingly, individuals with low percent of mapped reads and low percent of properly mapped reads had high genome average coverage. This implies no correlation between genome average coverage and percent of mapped and properly mapped reads.
Altogether, over 23.6 million SNPs and over 3.7 million InDels were identified, among them 24% of SNPs and 94% of InDels were novel. Such proportion between novel SNPs and InDels can be caused by the fact that many more studies have been related to SNPs and thus the SNP database is more complete 5 . A larger number of SNPs than InDels is not an uncommon observation. InDels in transcription regions are highly deleterious to gene function as they can completely alter protein amino acid sequence by changing the open reading frame (frameshift mutation). BTY contains the lowest number of variants, which is not only due to its length. For instance, BTA25 is the shortest autosome, but it contains much higher number of variants. Taking into account the length of each chromosome, we observed that only 0.1% of the whole length of the BTY contained SNPs, where it is 0.45% for www.nature.com/scientificreports/  www.nature.com/scientificreports/ BTX. On the other hand, the shortest chromosome (BTA25) exhibited the highest SNP density of 2.58%. One of the factors affecting DNA variation is the effective population size ( N e ). Estimated based on autosomes, the effective population size is higher than based on sex chromosomes-3/4 of autosomal N e for X and 1/4 of autosomal N e for Y 6 . The lower N e , the more important role of genetic drift, so we expect that the effect of genetic drift is the highest for the Y chromosome, which furthermore implies lower neutral diversity of Y 7 . This observation contradicts the theory of male mutation bias that assumes that more mutations accumulate in male germline due to a greater number of male germline cell divisions, which is especially pronounced for BTY which "spends more time" in male germline than the other chromosomes 8 . A difference observed in our study is possibly due to gaps in the BTY chromosome assembly of the bovine genome (i.e. a high number of unknown nucleotides), which humper an accurate comparison of variant density between autosomes, BTX and BTY. However, more recent studies emphasize that the comparison of mutation patterns on a general, chromosomewide scale, is not valid due to a strong variation in local mutation rates 9,10 . Also in our study SNP and InDel distribution showed a non-random pattern, characterized by mutation hotspots with very high variant density, albeit only on autosomes. We observed that SNP density is positively correlated with indel density, which is in agreement with results obtained for humans by Hodgkinson et al. 11 . Estivill et al. 12 showed that regions with highdensity of SNPs are correlated with segmental duplications in the human genome. Varela and Amos 13 declared that regions with unusually high recombination rates tend to have a high density of SNPs, while Aggarwala and Voight 14 estimated the effect of DNA sequence k-mers on SNP probability.
Based on the annotation of transcription regions, it can be shown that there is a tendency that variants in transcription regions on BTX have less severe consequences as compared to Y and autosomes. Fewer extreme variants are consistent with purging due to the hemizygous state in males. On autosomes, these would most of the time be hidden in males due to diploidy. Based on Ka/Ks ratio, we also observed that on the BTY there is a tendency to accumulate more nonsynonymous than synonymous substitutions (Ka/Ks = 2.00). In contrast, BTX shows a much lower ratio (Ka/Ks = 0.62). Such differences might arise from the tendency for the degeneration of the Y chromosome-Haldane's rule, which indicates faster male evolution and reason of accumulation of nonsynonymous variants, mainly due to the lack of the recombination 15 . In males, BTX is hemizygous, so any, even slightly deleterious mutation has an effect on the phenotype. For mammalian genomes such situation was demonstrated in a simulation study by Mackiewicz et al. 16   www.nature.com/scientificreports/ Shorthorn (5 bulls and 5 cows) breeds available courtesy of the Center for Quantitative Genetics and Genomics at the Aarhus University. Animals were not closely related. There can be found some distant relationship within the breeds. Whole-genome DNA sequences were obtained by the Illumina HiSeq 2000 Next Generation Sequencing platform and similar short-read sequencing platforms. Access to this data was available via the computer cluster of the Center for Quantitative Genetics and Genomics at Aarhus University. ARS-UCD1.2_Btau5.0.1Y 1 reference genome was used to processing whole-genome sequence data. This genome represents the latest reference genome of Bos taurus additionally assembled with the Y chromosome (Btau5.0.1) from Baylor College 17 . In this study two GFF (General Feature Format) files were used for the annotation of identified variants. Btau_5.0.1 and ARS-UCD1.2 GFF were merged to obtain a complete annotation file for each chromosome. Those files were downloaded from the NCBI database (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/ GCF/002/263/795/GCF_002263795.1_ARS-UCD1.2/ and ftp://ftp.ncbi.nlm.nih.gov/ genomes/all/GCF/000/003/205/GCF_000003205.7_Btau_5.0.1/). The actual assembly file used was downloaded from https ://sites .ualbe rta.ca/~stoth ard/1000_bull_genom es/ARS-UCD1.2_Btau5 .0.1Y. fa.gz.

Material
Processing the next-generation sequencing data. The FastQC software was used to summarize and visualize sequence quality 18 . Sequence quality metrics measure the probability that a given base is called incorrectly. Low quality bases were trimmed from reads using Trimmomatic 19 with options SLIDINGWINDOW:3:15, LEADING:3, TRAILING:3, and a minimum read length of 70 bp. Cleaned reads were aligned to the reference genome by using the Burrows-Wheeler Aligner (BWA) with the MEM algorithm 20 . After alignment, the flagstat tool in SAMtools 21 was used to calculate the percent reads correctly mapped to the reference genome. The genome average coverage, representing the number of times that a base in the reference genome was covered by aligned reads 22 , was calculated for each individual, using the genomecov tool in bedtools 23 . It was expressed by coverage= R·L G , where R is the total number of aligned reads, L is average read length and G is the genome size. The output from BWA-MEM was piped to SAMtools 21 and then compressed to the BAM format (Binary Alignment Map)-a binary version of the SAM format. Afterwards, SAMtools fixmate 21 was run to adjust the materead position. BAM files were sorted using SAMtools 21 . PCR duplicates were marked using MarkDuplicates from Picard 24 . Finally, Base quality scores were recalibrated using Genome Analysis Toolkit (GATK) 25 .
Variant calling. Variant calling allows the identification of differences between analyzed sequences and the reference genome, i.e., of polymorphic sites. First, GATK's HaplotypeCaller 25 was used to create files (gVCF files) that summarize information on sites potentially deviating from the reference. Specifically, the tool identifies genomic regions, so-called ActiveRegions, which contain significant differences from the reference genome. Those regions are then processed by HaplotypeCaller. Other variations (non ActiveRegions) are skipped in order to accelerate the analysis. Afterward, ActiveRegions are used to construct haplotypes by building a De Bruijnlike graph 26 and calculate haplotype frequencies. Haplotypes were realigned against the reference haplotype using the Smith-Waterman algorithm 27 to identify potentially polymorphic sites. Then a matrix of likelihoods of haplotypes given the DNA sequence of reads was calculated using Hidden Markov Models. Thereafter, Haplo-typeCaller assigned the most likely genotypes 25 . The standard phred-scaled confidence threshold of 30 was used to remove the potential false-positive variants. The last step of variant calling comprised merging gVCF files representing different individuals. For this purpose, GATK GenotypeGVCFs was used 25 . This step resulted in raw variant call files (VCF), which contained summary information of each detected variant.
After variant calling, the identified polymorphisms were annotated with predicted biological consequences and functions. For each polymorphic variant the associated Sequence Ontology (SO) 28 terms, which categorize genomic functions of the coding sequence, were assigned. For this purpose, the snpEff software was used 29 . The program uses information included in the annotation file (GFF) to assign an annotation to each detected variant. In addition, we predicted effects of nonsynonymous and missense variants on protein functions using the SIFT score. For prediction of effect of amino acid substitution we used SIFT4G software 30 . Variants with SIFT score ≤ 0.05 were classified as deleterious, while > 0.05 were considered tolerated 31 .
Statistical analysis. The Shapiro-Wilk test. The Shapiro-Wilk 32 was used to test whether a sample was obtained from a normal distribution, comparing the following hypotheses: The corresponding test statistic is given by: where a i is the tabulated Shapiro-Wilk coefficient, x (i) is the ith smallest value of the Tajima's D/nucleotide divergence and x is the mean of Tajima's D/nucleotide divergence. We reject H 0 at the significance level ( α = 0.05 ) if W < W α , where W α is tabulated critical threshold for Shapiro-Wilk. www.nature.com/scientificreports/ The χ 2 goodness-of-fit test. The χ 2 goodness-of-fit test was used to check whether the observed variable in the general population follows the uniform distribution. Corresponding hypotheses were defined as follows 33 : The distribution of the observed variable is uniform H 1 : The distribution of the observed variable is not uniform The χ 2 goodness-of-fit statistic is given by 34 : where O i is the observed count of observations in the ith group, E i is the count of observations expected under the uniform distribution, and k is the number of groups. Under H 0 , the test follows the χ 2 distribution with k − 1 degrees of freedom. where N is the total sample size, k is the number of groups, R i is the sum of ranks in the ith group, and n i is the size of the ith group. Under H 0 , the distribution of this test can be approximmated by the χ 2 distribution with k − 1 degrees of freedom.
The Mann-Whitney U test. The Mann-Whitney U test is the non-parametric equivalent of the t-test for two independent samples with the corresponding hypotheses 37 : H 0 There is no difference between the two populations' medians H 1 There is a difference between the two populations' medians The statistic is given by 38,39 : where n is the size of the first group, m is the size of the second group, and R(X i ) is the rank assigned to the first group (Wilcoxon statistic). We reject the null hypothesis H 0 at the significance level α if t n,m,1− α 2 < U < t n,m, α 2 , where t n,m,1− α 2 and t n,m, α 2 are quantiles of the Mann-Whitney distribution. For large sample sizes,

12
, where N = n + m . This test was applied to test multiple, simultaneous hypotheses; therefore the Bonferroni correction was used to account for multiple testing 40 .
Genetic statistics. Nucleotide divergence. One of the statistic most widely used to measure the degree of polymorphism in a chromosome is a nucleotide divergence ( π ). It is a measure proposed by Nei and Li 41 . π quantifies the nucleotide diversity among several sequences. In our case, we estimated π along each chromosome in 100 kb non-overlapping windows. The estimator π of nucleotide divergence calculated based on information from VCF files is defined as: where k is the number of variants within a given 100 kb window, AN i is the total number of alleles in called genotypes of ith variant, w denotes the window size [100 kbs], p is the number of pairwise windows comparisons, and AC i is the allele count in genotypes of ith variant (for each alternative allele).
Tajima's D. Another statistic used in this study is Tajima's D. It allows for the detection of the evidence of selection. Tajima 42 proposed this statistic as a measure of a rate of a random (neutral) evolution of DNA sequence. In this study, Tajima's D was estimated over a 100 kbs non-overlapping windows as follows: www.nature.com/scientificreports/ where θ π is the average number of pairwise differences given by θ π = i<j d i,j n(n−1) 2 . Here, d i,j represents the number of differences between individual i and j, n = N 2 is the number of pairwise sequences comparisons with N being the number of individuals. θ W is Watterson's estimator of the expected number of segregating sites under neutral- , where S is the number of sites that segregate in the sample).
D > 0 indicates population reduction or balancing selection, D = 0 indicates neutral mutations, and D < 0 indicates population expansion or purifying and positive selection 43 . The VCFtools 44 was used to calculate the Tajima's D statistic.
K a /K s ratio. The K a /K s ratio was used as a measure of the selection pressure. Based on this ratio we compared natural selection pressure on proteins between chromosomes 45 . K a represents the ratio of the number of nonsynonymous substitutions per nonsynonymous site, while K s is the number of synonymous substitutions per synonymous site. K a /K s equal to one indicates a neutral selection, the positive ratio is equivalent to positive selection, while the negative ratio indicates purifying selection. The output of the SnpEff 29 software was used to calculate K a /K s ratio.
Computing environment. All programs and scripts were written in the bash command language and executed on the Genomics High Performance Cluster (GHPC) at the Center for Quantitative Genetics and Genomics at Aarhus University. The computing unit was the Red Hat Enterprise Linux Client release 4.8.5-28 (Centos) with 250 GiB/node of memory and 6 Intel Core Processor CPUs. All statistical analyzes were done using R (version 5.1) 46 in RStudio 47 and visualized using the ggplot2 R package 48 .

Data availability
The subset of the whole-genome data used in this study is available at SRP039339 under PRJNA238491. For the remaining data, the Board of the 1000 Bull Genome Project Consortium should be contacted. Whole-genome sequences from Aarhus University are available only upon agreement with the breeding organization and should be requested directly from the authors.