Population Structure Analysis of Bull Genomes of European and Western Ancestry

Since domestication, population bottlenecks, breed formation, and selective breeding have radically shaped the genealogy and genetics of Bos taurus. In turn, characterization of population structure among diverse bull (males of Bos taurus) genomes enables detailed assessment of genetic resources and origins. By analyzing 432 unrelated bull genomes from 13 breeds and 16 countries, we demonstrate genetic diversity and structural complexity among the European/Western cattle population. Importantly, we relaxed a strong assumption of discrete or admixed population, by adapting latent variable models for individual-specific allele frequencies that directly capture a wide range of complex structure from genome-wide genotypes. As measured by magnitude of differentiation, selection pressure on SNPs within genes is substantially greater than that on intergenic regions. Additionally, broad regions of chromosome 6 harboring largest genetic differentiation suggest positive selection underlying population structure. We carried out gene set analysis using SNP annotations to identify enriched functional categories such as energy-related processes and multiple development stages. Our population structure analysis of bull genomes can support genetic management strategies that capture structural complexity and promote sustainable genetic breadth.

To explore structural complexity, whole genome sequences of 432 selected samples were hierarchically clustered using Manhattan distances (Fig. 3, colored by 13 different breeds). Samples from the same breed do not necessarily appear together, although that does not imply whether breeds capture substantial and useful characteristics of bulls. Similarly, mutual k-nearest neighbour graphs (mkNNGs) were created by applying NetView 11,12 for k = 6 and 12, where samples from different breeds are clustered together ( Supplementary Fig. 2). Based on hierarchical clustering dendrogram and mkNNG clusters, it is evident that genetic structure may be more complex than breed codes. The dimension of the population structure in logistic factor analysis (LFA) was set at d = 7, as estimated by the VSS algorithm and the scree plot of decreasing eigenvalues ( Supplementary Fig. 3). The estimated logistic factors demonstrate the genetic continuum, reflecting shared origins of genetics and goals of breeding programs since domestication (Fig. 4). At the same time, the logistic factor 4 displays a clear distinction of Brown Swiss (from Switzerland, Germany, France, and Italy) and projection of logistic factors (LFs) allows straightforward visual identification of clusters ( Supplementary Fig. 4). We enable interactive exploration of this population structure by creating an online app visualizing LFs according to user-specified parameters (https://nnnn.shinyapps.io/ bullstructure/).
We discovered diverse and pervasive genetic differentiation with respect to the population structure of bulls. We found that the median and mean values of McFadden's pseudo R 2 (hereafter referred to as R 2 ) are 0.070 and 0.087, respectively (Fig. 5). Chromosome 6 contained substantially more SNPs with high R 2 than other chromosomes; it harbors 166 (39.0%) out of 426 SNPs with R 2 > 0.6, as well as all 29 (100%) SNPs with R 2 > 0.7. On the other hand, the X chromosome shows the least variation with respect to logistic factors, containing zero  Additionally, independent analyses were conducted to confirm robustness of our results. Particularly, we applied PCAdapt methodology on the same ~4.0 million SNPs, to identify SNPs under selection. In particular, after population structure is estimated by k = 6 PCs, communality statistics 13 or Mahalanobis distances 14 between each genomic variable and the top k PCs are used to detect local adaptation. Absolute correlation statistics between the top 6 LFs and the top 6 PCs were very high: 0.999, 0.894, 0.890, 0.994, 0.994, and 0.992 for each comparison between i th LF and i th PC for i = 1, … , 6. High concordance between the two methods can also be seen in a scatterplot of the top two PCs, compared to that of LFs ( Supplementary Fig. 5). The Spearman correlation between R 2 measures w.r.t. LFs and communality statistic w.r.t. PCs is 0.86, whereas that between R 2 and Mahalanobis distances is 0.68. It may suggest that our method using McFadden's pseudo R 2 is more similar to communality statistic than Mahalanobis distances. Overall, the results from PCAdapt robustly support cattle population structure and genetic differentiation identified using LFA and R 2 .
Among SNPs with the highest R 2 > 0.7, there exist two regions on chromosome 6; specifically 14 SNPs (13 within 50 kbp of known genomic features) positioned between 71101370 and 71600122 and 15 SNPs (11 within 50 kbp of known genomic features) positioned between 38482423 and 39140537. 83% of those most differentiated SNPs (20 out of 24 SNPs with known genomic features) are within or close to genes related to the selection sweep according to ref. 15. Among the first region, five SNPs fall within CHIC2 (ENSBTAG00000032660), while the closest features within 50 kbp also include GSX2 (ENSBTAG00000045812), U6 spliceosomal RNA (ENSBTAG00000042948), and novel pseudogene (ENSBTAG00000004082). U6 spliceosomal RNA (ENSBTAG00000042948) and novel pseudogene (ENSBTAG00000004082) are known to be associated with milk protein percentage 16 . In the second region, the exact overlaps occur in FAM184B (ENSBTAG00000005932), LCORL (ENSBTAG00000046561), and NCAPG (ENSBTAG00000021582). LCORL encodes a transcription factor whose human ortholog is involved in spermatogenesis, whereas NCAPG is crucial in mitosis and meiosis. Expecting much granular investigation of such genomic features, the list of 396,800 SNPs at the top 90 percentile (R 2 > 0.174) is available as Supplementary Data 2.
To better understand evolutionary and biological processes, we conducted gene set analyses using genomic annotations of SNPs. Firstly, we found that SNPs located within known genomic features have about 1.8% higher R 2 measures than intergenic SNPs without annotations (MWW p-value 9.85 × 10 −106 ; Bonferroni corrected p-value 2.46 × 10 −106 ). On the other hand, among intergenic SNPs, we found no significant correlation (p-value of 0.44) between SNP-feature distances and R 2 measures ( Supplementary Fig. 6). Secondly, among genic SNPs, R 2 measures corresponding to SNPs within exons are slightly higher than those within introns by 0.27% with a MWW p-value 3.89 × 10 −29 (Bonferroni corrected p-value 9.73 × 10 −28 ). Start/stop codons and 3′ /5′ UTR do not exhibit statistically significant difference from other genic SNPs. Lastly, we used 338 genes that are closest to SNPs with R 2 > 0.5 in the DAVID functional annotation tools. We found a total of 34 enriched annotation clusters, of which 11 clusters with enrichment scores > 0.5 are shown in Table 1. Biological processes and functions related to calcium-binding domain (cluster 1 and 9) and iron containing hemeproteins related to ATP (cluster 3 and 6) exhibit strong enrichment, potentially reflecting causes of population structure. Notably, we observed functional clusters for sexual, respiratory, and embryonic development (cluster 5, 7, and 10, respectively).

Discussion
Bos taurus has played a crucial role in ancient and modern societies alike by providing agricultural support and essential nutrients. Accurate characterization of its population structure helps conservation of genetic resources and optimal selection programs, ensuring a healthy and sustainable cattle population. In this process, we can better infer the genetic and functional variation that underlies the population structure. Using 432 samples from the 1000 Bull Genome Project, we provide a comprehensive sequencing-based assessment of population structure among cattle of European and Western ancestry. Assumptions underlying population structure and its estimation methods have evolved to address growing genomics data in terms of complexity and scales 10,[17][18][19] . Previous studies on genetic structure of cattle often model their samples as admixture of k ancestral populations. This critical choice of k depends on analytical solutions, such as log probability of data 18 , its rate of change 20 , or validation on independent test datasets (i.e., cross-validation) 21 . However, these methods may be sensitive to early divergence events or unable to capture hierarchical relationships 7 . Analysis of regional breeds often needs to include other published cattle genomes in order to estimate introgression or admixture 5,8,9 . This poses a significant challenge in population genomics.
We circumvent this challenge by using complementary methods that do not need to select k ancestral populations. Particularly, we utilize latent variable probabilistic models that can estimate a broad range of arbitrarily complex structure including admixture, continuous, and discrete population 10 . Some breeds are clearly distinguished by logistic factors (LFs), such as Brown Swiss by the fourth LF. However, LFs do not directly correspond to breeds or ancestral populations. To aid in comprehensively describing and exploring population structure from our analysis, we developed an interactive visualization app.
When modeling SNPs with logistic factors in generalized linear models, we found widespread genetic differentiation due to population structure. Despite making no assumption about structure, the majority of the most differentiated SNPs in our study have been identified as under selection sweep by previous studies. Chromosome 6, which harbors a large proportion of the highly differentiated SNPs, has been previously suggested to have been subjected to one or more selective sweeps 1 and has also been associated with a number of milk and beef production traits 22,23 . Interestingly, given that the novel pseudogene (ENSBTAG00000004082), which has been known to be associated with calving performance 24 and protein percentage 16 is strongly associated with population structure, we suspect that it plays a crucial functional role in cattle genomes.
Our genome-wide study of differentiation suggests stronger evolutionary pressure on genic regions. Prolonged changes in environment, driven by domestication and development of cattle breeds, have likely caused genetic differentiation that focuses on functional regions of genomes 25 . Furthermore, enrichment analysis of genome annotations provides strong indications that functional groups related to energy production and development stages underlie the genes that are highly differentiated with respect to population structure.
This study paves a way to further our understanding of population structure among modern European and Western cattle breeds. Identification of genetic differentiation with respect to population structure may inform conservation efforts to preserve heritage breeds and maintain genetic diversity. Methodologically, our sequencing-based analysis of population structure represents non-parametric approaches that can identify genetic differentiation and complexity without strong assumption on structure in population genomics.

Methods
Bull Genomes. The 1000 Bull Genomes Project has collaborated to gather whole-genome sequences of breeds from Australia, Austria, Canada, Denmark, Finland, France, Germany, Italy, Netherlands, New Zealand, Norway, Spain, Sweden, Switzerland, and United Kingdom. Its initial efforts have vastly expanded known single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) in Bos taurus 2 . Currently, it covers 1577 bull samples as of version 5 released in 2015, among which 1507 and 70 bull genomes were sequenced with Illumina/Solexa and ABI SOLiD technology, respectively. For analysis of population structure, we selected unrelated bulls with average sequencing coverage greater than 5. Among sibs only one representative was selected randomly. SNP genotypes were identified prior to our study based on whole genome sequence data of bulls, using a multi-sample variant calling procedure. Polymorphisms with minor allele frequencies below 0.05 were removed To hierarchically cluster samples, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is applied to Manhattan distances 29 . When visualizing a resulting dendrogram, nodes are colored by breed codes. Alternatively, we applied netview to create mutual k-nearest neighbour graphs (mkNNGs) based on the same set of SNPs 11,12 . Unlike hierarchical clustering, mKNNGs assign discrete memberships, which are visualized in a force-directed graph (as implemented in netview).
To infer population structure directly from a genome-wide genotype matrix, we consider a probabilistic model of individual allele frequencies. In particular, by using logistic factor analysis 10 that captures systematic variation of individual-specific allele frequencies arising from discrete or continuous sub-population, spatial variation, admixture, and other structures, we relax statistical assumptions imposed on bulls by its official breed and country code defined in the animal registration ID. While the statistical models and algorithms are extensively described elsewhere 10 , we provide a brief overview of this approach here.
Consider a genotype matrix Y with m SNPs and n bulls. For each y ij , an individual-specific allele frequency for i th SNP and j th bull is f ij ∈ [0, 1]. This collection of parameters (a m × n F matrix) is transformed into real numbers via the logit function, which allows computation of the underlying latent structure. Overall, the statistical model considered is Then, the population structure is captured by d logistic factors (LFs) H which can be estimated by applying principal component analysis (PCA) to logit (F). Note that A is a matrix of coefficients in a logistic regression. The dimensions of logistic factors are estimated by comparing the observed correlation matrix to a series of hypothesized structures derived from selected variables of large loadings 30 . In the Very Simple Structure (VSS) algorithm, we considered d = 1, … , 100, while applying principal component analysis on the mean-centered genotypes (R package psych). Eigenvalues of m −1 Y T Y and percent variance explained by each component are visually inspected for the inflection point (e.g., elbow). For robustness analysis to confirm genetic differentiation, we alternatively used cross-validation approximations to choose d 31 .
To approximate how much of the variation in genotypes is explained by the population structure, we calculate McFadden's pseudo R 2 that is appropriate for a logistic regression 32 . For i th SNP, i null are maximum log-likelihoods of the full and null models, respectively. As this study only considers McFadden's pseudo R 2 in logistic regressions, we will henceforth refer to it as R 2 when clear in context. Significance analysis with respect to logistic factors (or principal components) are done with a resampling-based jackstraw method 33 .
Additionally, we performed genome-wide scan for selection in the panel of SNP data using PCAdapt 13,14,34 . Generally, PCAdapt uses Mahalanobis distances and communality statistics between SNPs and the first k principal components (PCs), with appropriate normalization specific to each measure. Selection is detected when SNPs (or other genetic markers) are substantially explained by the first k PCs 13,34 . To evaluate concordance of results from PCAdapt and LFA, we compute Spearman correlation between Mahalanobis/communality statistics using PCs and McFadden's pseudo R 2 measures using LFs.

Annotation and Enrichment.
For genome annotation, we used the latest Bos taurus reference genome from the Center for Bioinformatics and Computational Biology, University of Maryland (downloaded from the NCBI server ftp://ftp.ncbi.nlm.nih.gov/, version UMD3.1.83).
When testing whether the distribution of McFadden's pseudo R 2 measures are significantly different according to feature types, we used the Mann-Whitney-Wilcoxon (MWW) test 35 . With a large sample size, a Normal approximation is used to compute MWW p-values. In particular, we investigated whether SNPs falling within genes may have a higher McFadden's pseudo R 2 than those in intergenic regions. Among SNPs with known feature assignments, MWW tests were used to infer if a particular feature type is associated with significantly higher R 2 measures. Bonferroni correction is applied on a set of four MWW tests to adjust for multiple hypotheses testing 36,37 .
Lastly, because some of SNPs are in intergenic regions with no known annotations, we utilized the closest features function from . . BEDOPS v2 4 15 27 . Among the top genes with McFadden's pseudo R 2 > 0.5, we apply DAVID v6 7 . considering GO, KEGG pathways, InterPro, SwissProt Protein Information Resource, and other databases to identify enrichment of biological processes and functional pathways 38 . For intergenic SNPs, we Scientific RepoRts | 7:40688 | DOI: 10.1038/srep40688 searched the reference genome for the closest genes, which were used in . DAVID v6 7. When clustering functional annotations, we set "Classification Stringency" to high.