Genome-wide scan for selection signatures reveals novel insights into the adaptive capacity in local North African cattle

Natural-driven selection is supposed to have left detectable signatures on the genome of North African cattle which are often characterized by the fixation of genetic variants associated with traits under selection pressure and/or an outstanding genetic differentiation with other populations at particular loci. Here, we investigate the population genetic structure and we provide a first outline of potential selection signatures in North African cattle using single nucleotide polymorphism genotyping data. After comparing our data to African, European and indicine cattle populations, we identified 36 genomic regions using three extended haplotype homozygosity statistics and 92 outlier markers based on Bayescan test. The 13 outlier windows detected by at least two approaches, harboured genes (e.g. GH1, ACE, ASIC3, HSPH1, MVD, BCL2, HIGD2A, CBFA2T3) that may be involved in physiological adaptations required to cope with environmental stressors that are typical of the North African area such as infectious diseases, extended drought periods, scarce food supply, oxygen scarcity in the mountainous areas and high-intensity solar radiation. Our data also point to candidate genes involved in transcriptional regulation suggesting that regulatory elements had also a prominent role in North African cattle response to environmental constraints. Our study yields novel insights into the unique adaptive capacity in these endangered populations emphasizing the need for the use of whole genome sequence data to gain a better understanding of the underlying molecular mechanisms.

Taurine cattle were first introduced to Africa through Egypt from the Fertile Crescent ~ 6500 years BP 1 and dispersed into North Africa where they have undergone hybridization with local wild aurochs 2 . The geographic proximity of North Africa to Europe makes it a likely contact zone between the two continents. Several genetic studies reported an old presence of African cattle ancestry in the genomes of Iberian cattle 2,3 and a European ancestry in local Maghreb cattle [4][5][6] . Nomad pastoralism and tribal migrations prevented the division of North African cattle populations into clearly defined breed groups. Present-day indigenous cattle in Morocco, Algeria Tunisia and Libya belong to the Brown Atlas cattle. These are small-sized, sturdy, fairly compact animals with fine limbs, a short head and a straight to slightly concave profile. In these countries, Brown Atlas cattle populations, predominantly pasture-fed, are raised in a Mediterranean climate characterized by a winter rainfall and a hot dry summer during which live weight losses in adult cows can reach 20% 7 . In Egypt, indigenous cattle are medium sized, long-bodied animals, lean of musculature and lightly boned. They are raised either in desert or semi-desert regions characterized by a very arid Mediterranean climate and negligible rainfall. A number of ecotypes are recognised based on their geographical distribution. For instance, in Lower Egypt there are two local cattle populations, the Damietta is typically found in coastal sites and the Baladi or Baheri is widespread inland in the delta 8 . Overall, North African indigenous cattle are resistant to many of the diseases and parasites to which imported European cattle are susceptible 7 resulting from a local environment-driven selection that occurred over hundreds of years. Adaptation to local conditions is expected to leave distinct signatures in the www.nature.com/scientificreports/ genome known as a "selective sweeps" owing to a rapid increase in the frequency of the desirable alleles or in the frequency of neutral markers in linkage disequilibrium with the favorable alleles 9 . Studies on signatures of selection focusing exclusively on North African cattle have never been reported before. The emergence of high-throughput single nucleotide polymorphism (SNP) genotyping and whole genome sequencing facilities coupled with the development of new genomic methodologies have enabled the screening of a large part of the genome to detect signatures of selection in livestock and domestic populations [10][11][12][13][14] . All these studies have used comparison of genomic patterns of SNPs variability between local and exotic breeds to identify genomic regions and genes that have undergone selective sweeps.
The main goal of this study was to investigate population structure and candidate positive selection signatures in North African cattle using genotype data from the Illumina BovineSNP50 BeadChip with comparisons against four European breeds, three African and two indicine populations. We applied four genome scan approaches to identify genomic regions putatively under selection: the first three methods are extended haplotype homozygosity (EHH)-derived statistics (iHS, Rsb and XP-EHH) and are based on the decay of haplotype homozygosity as a function of recombination distance. The fourth approach is a Bayesian method based on the differentiation of allele frequencies among populations.

Results
Population structure analysis among all cattle populations. We used Principal Component Analysis (PCA) to contextualize the genetic variation of North African cattle populations (Fig. 1). The first two principal components accounted for 5.67% (PC1) and 3.74% (PC2) of the total genetic variation. The global www.nature.com/scientificreports/ organization of the genetic diversity of the populations of the study might be described as a triangle with apexes corresponding to North European breeds (Angus (ANG) and Holstein (HOL)), African taurines (NDA, ND1 and ND2) and indicine populations (NEL and GIR). PCA results show that the Tunisian Brune de l' Atlas (TUN-IND) and the Algerian populations (Guelmoise (GUE) and Cheurfa (CHE)) are closer to each other than to the Moroccan (Oulmes Zaer (OUL) and Tidili (TID)) and the Egyptian (Baladi (BAL)) populations. Furthermore, these results distinguished Biskra (BIS) and Chelifienne (CHF) from the other North African populations. The former was positioned near European breeds with several BIS individuals clustering along with Montbéliarde (MON) while CHF individuals showed a higher dispersion around their center of gravity (with several individuals positioned near MON) indicating a high genetic heterogeneity. Breed assignment to clusters using ADMIXTURE provided further insight into the genetic structure of North African populations. Figure 2 shows the results obtained for K values 2, 3, 5, 7, 10, 12 and 17. K = 10 showed the lowest cross-validation error ( Supplementary Fig. S1). At K = 2, European taurine breeds were separated from indicine and African cattle. The K = 3 model further separated African populations from indicine cattle. All North African populations except BAL carry two main European and African ancestries. In agreement with PCA results, BIS shows the largest amount of European ancestry with a minimum of 61.86% and a maximum of 88.5% while the Moroccan TID has the largest amount of African ancestry with a minimum of 55.67% and a maximum of 70.32%. For its part, BAL possesses a significant amount of indicine ancestry with a minimum of 16.41% and a maximum of 30.35%. At K = 5, the three European breeds (ANG, HOL and Jersey (JER)), formed three different clusters. All North African populations had on average 21.69% (with a minimum of 10.93% in BAL and a maximum of 29.42% in BIS) and 19.47% (with a minimum of 10.85% in BAL and a maximum of 46.37% in BIS) of JER and HOL ancestries, respectively. At K = 7, all North African populations except BIS and a few CHF individuals can be seen as distinct from the other breeds with a major "North African" component ranging, on average, from 48.8% for BAL and CHF to 79.5% for TID. It is worth noting that BIS displayed a substantial level of MON introgression (on average, 32.1%) while no African ancestry was detected within this breed (Fig. 2). At K = 10, BAL separated from the other North African populations while this happened for OUL when K was set to 17.  Supplementary Table S1. Most of North African populations showed low differentiation levels. The lowest F ST values are found between CHE and GUE (F ST ~ 0), CHE and TUNIND (F ST = 0.002) and between GUE and TUNIND (F ST = 0.003). Likewise, low genetic differentiation is observed between TID on one hand, GUE, CHE and TUNIND, on the other hand (F ST TID/ GUE = 0.016, F ST TID/CHE = 0.016 and F ST TID/TUNIND = 0.015) while a higher F ST is observed between these three breeds and BAL (0.042, 0.042 and 0.045 for BAL/CHE, BAL/GUE and BAL/TUNIND, respectively).

Discussion
The main purpose of the present study is to unravel signatures of positive selection in North African cattle. Because we used several breeds with diverse population structure, the main challenge in our study was to minimize the rate of false-positive signals that can arise, inter alia, owing to the confounding effects of population demographics 15 . Assuming that populations with similar structure have undergone similar evolutionary processes, in our selection signature detection analyses, we retained only North African populations showing a high degree of within population genetic homogeneity and a large portion of North African ancestry. In agreement with previous studies 6 our genome analyses are consistently and strongly in the direction of a substantial and recent contribution of European breeds to the genomes of BIS and CHF (Figs. 1, 2). Furthermore, in the admixture models in which K = 7, 10 and 12, the individuals sampled from these two breeds showed a high degree of within population genetic heterogeneity. Therefore, BIS and CHF were discarded from the subsequent selection signature analyses. Our results corroborate previous reports 16 suggesting that BAL resulted from a three-way admixture between breeds representative of European, African and indicine cattle. The presence of an indicine content within the genome of BAL is consistent with a wave of indicine introduction during the rinderpest epidemic of the nineteenth century 1,17 . Our results indicate that all North African populations share ancestry with Jersey cattle which supports previous whole genome sequencing analyses reporting a common distinct patriline of Jersey bulls with African cattle 18 . Overall, our findings indicate that modern North African cattle can be classified into 3 subgroups. The first one is the "Brune de l' Atlas" population which possesses two main African and European ancestries. This subgroup includes the Moroccan TID, the Algerian GUE and CHE and the Tunisian Brune de l' Atlas. The second subgroup consists of the Egyptian local cattle which possesses an additional large portion of indicine ancestry (at the expense of European ancestry). The third subgroup, represented by CHF and BIS, includes European-derived breeds. The phylogenetic network inferred by TreeMix corroborate these findings in that CHF and especially BIS are in clade with the European breeds while CHE, TID, TUNIND and GUE share the same branch and are much closer to African populations.
In this paper, we present the first genome-wide scan of putative selective sweeps in North African cattle by combining four different statistical methods based either on the decay of haplotype homozygosity as a function of recombination distance or on allele frequency differentiation among populations. In total, we highlight the presence of 36 different genomic regions putatively under selection using the first type of approaches (iHS, Rsb and XP-EHH) and 92 outlier SNPs using Bayescan. Consistently with previous observations 19 , we observe little overlap between results obtained from each of the two types of approaches. Given that Bayescan assumes that the gene frequencies under any neutrally structured population model can be approximated by a multinomial Dirichlet distribution 20 which would not be appropriate in a hierarchical population structure 21 (as is the case for our North African sample), the 92 identified SNPs should be considered cautiously. Instead, we believe that the three EHH-based methods, which inter alia, can detect a wider range of selection scenarios 22 , are more suitable to our study design. These statistics take advantage of the reduction in haplotype diversity in the neighbourhood of a beneficial mutation due to a "hitch-hiking" effect. They measure the extended haplotype homozygosity which is defined as the probability of identity by descent for two randomly chosen haplotypes carrying a core haplotype of interest in an interval around a given locus, given that they have the same allele at the locus 23 . Unlike Rsb and XP-EHH, the iHS test has low power in identifying fixed sweeps because it requires the ancestral allele to be still segregating in the population 24 . Here, we identified a higher number of outlier windows using Rsb and XP-EHH compared to the iHS approach which might suggest, at first glance, that most of the candidate regions identified here have undergone a positive selection resulting in the (near) fixation of the favoured alleles across the populations. However, we believe that the low number of candidate regions identified by the iHS test www.nature.com/scientificreports/ is actually due to the fact that this approach searches for loci where a given high-frequency haplotype is much longer relative to all other haplotypes, yet in a soft sweep several long haplotypes will be present at the adaptive locus and thus not one haplotype will typically be much longer than all others 25 . Our hypothesis assumes that the majority of sweeps detected here are soft which is likely to be the case. Soft sweeps were shown to be widespread and account for the vast majority of recent environmental adaptation in several species such as Humans 24 . A common constraint of selection signature detection methods is the detection of false positives. One efficient way to reduce their number is to retain as outliers, those genomic regions detected by distinct methods 26 . Among the 36 genomic regions identified by EHH-based methods, 10 were detected by two tests and one candidate region was identified by all three tests. In addition, two other regions (BTA07: 36,720,000-38,670,000 bp and BTA08: 88,100,000-90,020,000 bp) identified by the Rsb EUT/North AFT comparison included two outlier SNPs detected by Bayescan. We particularly focused on genes located within these 13 genomic regions. In agreement with previous findings 27,28 , we observed that the three candidate regions jointly identified by the Rsb and XP-EHH tests in the EUT/North African comparison were significantly enriched for genes involved in olfactory receptor activity (21 genes) which might reflect the fact that selection has been relaxed around these genes in European breeds which are often raised in abundant food supply conditions. Two genes (OR2W3 and OR2L13) coincided with CNVs previously reported in cattle (Supplementary Table S4). Olfactory receptor genes are duplicated within the bovine genome 27 and CNVs encompassing these genes were found to be associated with population-specific differences in smell in most mammalian species 29 . Many of our candidate regions harboured genes implicated in the adaptive immune response against microbial pathogens. For instance, the clearest sweep signal in the EUT/North AFT comparison detected on BTA07 (between positions: 41.06 and 43.62 Mb) with 13 SNPs (out of 32) exceeding the significance threshold, harboured 58 known genes amongst which six (AZU1, ELANE, GZMM, PRSS57, PRTN3, CFD) belong to the S1A family of peptidases, a superfamily of proteolytic enzymes with a wide variety of biological functions in parasite infection 30 . Similarly, another relevant selection signature on the BTA19 jointly detected by iHS, Rsb and XP-EHH EUT/North African harboured several genes which are involved in immune response: CD79B, MILR1, PECAM1, MAP3K3 and TCAM1. The last two genes mediate NF-kappa-B activity which show evidence of positive selection in the African N'Dama cattle to alter in functions to effectively regulate the infection of cattle trypanosome 31 . Consistently, we also observed that outlier windows from AFT/North African and IND/ North African comparisons included many genes associated with immune response and host defence such as TNFRSF11A, IRF8, MYO1G and several GTPases of immunity-associated protein (GIMAP) genes (GIMAP4, GIMAP5 and GIMAP7). Several of these genes (GIMAP4, GIMAP5, GIMAP7, IRF8) coincided with CNVs reported in cattle (Supplementary Table S4). A major phenotype of North African cattle populations is their resistance to parasitic diseases such as theileriosis, babesiosis and anaplasmosis 32 which are highly prevalent in North Africa 33 . We suggest that the aforementioned genes have been under evolutionary pressure in North African cattle and that some of them may have experienced enhanced fixation of duplicates resulting from selection for increased dosage to effectively regulate the innate and acquired immune response to parasitic diseases. A previous study 34 conducted on Brazilian Bos indicus cattle, similarly reported that CNVs are important modulators of immune gene expression. Our results have also revealed a series of other genes involved in the regulation of blood pressure and heart contraction (ACE, ACE3, COX4I1, NOS3, CXADR), blood vessel development and morphogenesis (CCM2, FOXC2, FOXF1, MAP3K3). These genes are expected to be involved in adaptation to extreme temperatures prevailing in several Northern African areas and/or to chronic hypoxia in the Atlas mountain ranges where the altitude varies between 900 and 4000 m 7 . Our hypothesis is consistent with the presence of three hypoxia-related genes (BCL2, HIGD2A and CBFA2T3) and three other genes involved in response to heat (ASIC3, HSPH1 and MVD) in the relevant candidate regions (Table 1). It is also interesting to note that the strong selection signal on BTA19 harboured a well-known gene, GH1, linked to response to nutrient levels (GO: 0031667), positive regulation of lactation (GO:1903489) and triglyceride biosynthetic process (GO:0010867) and was previously reported as being a candidate gene for dairy production traits in Braunvieh cattle 15 . Importantly, it has been suggested that elevated GH1 gene expression may constitute an adaptive response to the effects of scarce food supply in a sample of 163 human individuals from Benin 35 . We therefore suggest that this gene is particularly under positive selection across North African cattle populations as a consequence of important seasonal fluctuations in food availability characterizing the whole region.
Six out of the 13 relevant candidate regions identified in this study, harboured fewer than 15 known protein coding genes ( Table 1). Many of these genes have also been reported in cattle and other species. For instance, the outlier window on BTA01 (at position: 17,740,000-19,640,000 bp), contained 6 protein coding genes including TMPRSS15 and CHODL, two genes that were reported to be under selection in the Iraqi indigenous cattle 13 . Similarly, the candidate region on the BTA24 (at position: 59,660,000-61,790,000 bp), harboured RNF152 gene which positively regulates Toll-like receptors (TLRs) which are important pattern recognition receptors that are critical for the defence against invading pathogens 36 . RNF152 gene was reported to be involved in local adaptations in the Ainu, a hunter-gatherer population of northern Japan 37 . Another relevant candidate region on BTA21 (at position: 14,830,000-16,650,000 bp) harboured four protein coding genes: SLCO3A1, SV2B, AKAP13 and KLHL25. The latter two genes were shown to be under positive selection in Creole cattle breeds 38 while SLCO3A1 is associated with marbling score in the Montana Tropical Composite beef cattle 39 and mediates inflammatory processes in intestinal epithelial cells through NF-kappa-B transcription activation in humans 40 . SV2B gene is among major genes enriched for the extracellular matrix (ECM) around the hair follicle in Changthangi goats 41 . ECM is considered important for regulating the structure, metabolism and signaling of dermal papilla cells which play key roles in hair follicle morphogenesis and regeneration 42 . Another candidate region on the BTA22 (at position: 4,790,000-6,620,000 bp) identified by the XP-EHH IND versus North AFT test harboured four genes (GADL1, TGFBR2, STT3B and OSBPL10) and among these, GADL1 gene is one of the genes involved in adaptive evolution of Anolis carolinensis introduced into the Ogasawara archipelago 43  www.nature.com/scientificreports/ decreased anxiety, increased levels of oxidative stress markers, alterations in energy and lipid metabolism, and age-related changes 44 . STT3B is a catalytic subunit of hetero oligomeric oligosaccharyltransferase (OST), which is important for asparagine linked glycosylation. In mammals and plants, OSTs exhibit distinct levels of enzymatic efficiency or different responses to stressors 45 . OSBPL10 gene confers African-ancestry protection against dengue haemorrhagic fever in admixed Cubans 46 .
A further result is that the 13 outlier windows identified by at least two approaches included myriad of genes involved in transcriptional regulation (AEBP1, ARID3A, BANP, CBFA2T3, DDX5, FTSJ3, GLI3, MIER2,  POLR2E, POLRMT, FOXC2, FOXF1, FOXL1, SMARCD2, SMARCD3, TNFRSF11A, BPTF, CDK5, …) as well as many non-coding RNAs including 9 small nucleolar RNAs (snoRNAs), 12 microRNAs (miRNAs), 10 small nuclear RNAs (snRNA) and 13 long noncoding RNAs (lncRNAs). In addition, many of the aforementioned genes (BANP, CBFA2T3, GLI3, POLR2E, POLRMT, FOXC2, FOXF1, FOXL1) co-localize with known cattle CNVs. It is worth noting that CNVs encompassing a gene encoding a transcription factor has a greater phenotypic impact because it can affect both the coding sequence of the gene itself as well as the expression of downstream targets of that gene. From a selective standpoint, these findings suggest that natural selection has shaped North African cattle genome not only through variation in coding sequence but also through extensive regulation of gene expression occurring both at the transcriptional and post-transcriptional level. Lending further support to this hypothesis, the relevant candidate region on BTA24 (at position: 59,750,000-61,740,000 bp) harbours a single gene, CELF4, coding for an RNA-binding protein mainly expressed in central nervous system that regulates the expression of many genes co-transcriptionally or post-transcriptionally via interactions with mRNA 47 . Celf4deficient mice have additional neurological abnormalities including hyperactivity and hyperphagia-associated obesity 48 . Similarly, the most relevant selection signal in the AFT/North AFT comparison (BTA06 at position: 46,780,000-50,050,000 bp) harboured one protein coding gene (PCDH7) which coincides with a known CNV (Supplementary Table S4), one 5S ribosomal RNA (5S rRNA) and three non-coding RNA genes: SNORA70, Y_RNA and U6 (Table 1). PCDH7 is one of the key genes involved in oncogenesis and/or differentiation of the cancer stem cells through a change in its histone methylation status 49 . Likewise, 5S rRNA genes are highly methylated in Arabidopsis thaliana and their expression is under epigenetic control 50,51 .
During the process of fixation of adaptive variants, linked neutral markers are dragged along with the selected site; thus reducing the levels of genetic diversity in the region, while simultaneously new mutations accumulate in the region. The initial frequency of these mutations is low, so that a DNA sequence harbouring a positively selected variant will also harbour an excess of rare derived alleles. Bearing this in mind, we expect that many other sweeps are not detected by our genome scan owing to ascertainment schemes used to discover the BovineSNP50 BeadChip. Clearly, shedding light on additional selective sweeps in North African cattle would require the use of whole genome sequence data and the inclusion of all variants in genetic analyses.
The present study highlighted, for the first time, the presence of putative selection signatures in six local North African cattle populations. Information about the location of these regions can now be used as a starting point to identify causal genetic variants that control some environmental adaptation traits in local breeds which can be utilized in the genetic improvement of commonly used commercial breeds world-wide. Our results are unique in indicating that selection have shaped North African cattle genome through extensive regulation of gene expression whereby the individuals get adapted to short as well as long-term environmental changes. Understanding the functional consequences of such adaptive elements remains a challenge to overcome.

Methods
Data merging and SNP filtering. We combined Illumina BovineSNP50 BeadChip genotypes of 57 Brune de l' Atlas individuals (TUNIND) sampled from our previously published data 4,52 with data already available for 221 animals belonging to seven North African populations (BAL, BIS, CHE, CHF, GUE, TID and OUL) obtained from Flori et al. 16 and Gautier et al. 53 . We also included genotyping data belonging to 9 other populations, representatives of European taurines (EUT) (four breeds: ANG, HOL, JER and MON), African taurines (AFT) (three N'Dama populations: ND1, ND2 and NDA) and indicine (two populations: GIR and NEL) from Matukumalli et al. 54 . All genotypes were recovered from the web-interfaced genetic Diversity Exploration (WIDDE) database 55 . We performed a relatedness test between individuals within each population using PLINK 56 . The software calculates a variable called PI-HAT reflecting extended haplotypes shared between distantly related individuals. For European, indicine and African breeds, we removed closely related individuals if the PI-HAT value was greater than 0.25 which is a value roughly corresponding to relationships closer than grandsire-granddaughter. For the North African populations, in which natural service is commonly used rather than artificial insemination and are thus generally less inbred, we used a more stringent threshold and excluded one individual from each pair of individuals with a PI-HAT value > 0.1. In total, after relatedness filtering, 468 individuals including 204 North African animals, were available for the different analyses (Supplementary Table S10). We also applied a series of quality control procedures to the genotype data. First, we excluded rare SNPs with low minor allele frequencies (MAF) < 0.05. Then, the whole genotype dataset was subjected to linkage disequilibrium (LD) pruning using the default parameters of PLINK (SNP window size:50, step 5 SNPs, r 2 : 0.5). In total, 38,464 SNPs spread over all autosomal chromosomes were finally considered for population structure analyses.
Population structure and genetic relationship analyses. Population structure was inferred by PCA for African, European, indicine and North African populations using the adegenet R package 57 . Unsupervised hierarchical clustering was carried out for all populations using ADMIXTURE 1.23 software 58 . We ran ADMIX-TURE with cross-validation for values of K from 2 through 17 (the number of populations) to identify the best value of K clusters. DISTRUCT software 59 was then used to graphically display ancestry within each individual. The pairwise fixation index (F ST ) between populations was estimated using Genepop 4.6 software 60  www.nature.com/scientificreports/ of population splits and mixtures were inferred using TreeMix 61 . First, we built a maximum likelihood tree of the 17 populations of the study with no migration events allowed and setting GIR as outgroup. Then, we built a phylogenetic tree of these populations and started adding migration events (modeled as edges) sequentially to the phylogenetic model. The migration edges were added until 99.93% of the variance in ancestry between populations was explained by the model. The residuals from the fit of the model to the data were visualized using the R script implemented in TreeMix.

Identification of selection signatures.
To perform selection signature detection, we selected the individuals that are most representative of the ancestral North African cattle. This was done based on the results of model-based clustering results. We used the population differentiation based analysis implemented in BayeScan (F ST ) 62 and three extended haplotype homozygosity (EHH)-based tests (iHS, Rsb and XP-EHH) to detect signatures of selection within North African cattle. Bayescan, Rsb and XP-EHH analyses were performed for each of the three pairwise comparisons: North African cattle versus AFT, North African cattle versus EUT and North African cattle versus IND. Bayescan uses a reversible-jump Markov Chain Monte Carlo to separate locusspecific effects of selection from population-specific effects of demography. Outliers are those loci that require the locus-specific component to explain observed genetic diversity. For the Markov chain Monte Carlo (MCMC) algorithm we used 20 pilot runs of 5,000 iterations, a burn-in of 50,000 iterations, a thinning interval of 10 (5,000 iterations were used for the estimation of posterior odds) with a resulting total number of 100,000 iterations.
To control the number of false positives, significant SNPs were defined by applying a q-value threshold of 0.05. Haplotype extended patterns were investigated using three metrics implemented in rehh package 63 : the iHS within-population statistic 64 and two between-population methods: Rsb 65 and XP-EHH 66 . In iHS computation, the information on the ancestral and derived allele state is needed for each SNP because this statistic is based on the ratio of the EHH associated to each allele. In our analysis, the ancestral allele was inferred as the most common allele within 3 out-group species including yak, buffalo and sheep. iHS scores for each SNP were transformed into two-sided p values: piHS = − log10[1-2|Φ(iHS)-0.5|]. As a prerequisite to the Rsb and XP-EHH computation, haplotypes were reconstructed from the genotyped SNPs using fastPHASE 1.4 67 . The following options were used for each chromosome: -T10 -Ku60 -Kl10 -Ki10. Considering that Rsb and XP-EHH values are normally distributed, a Z-test was applied to identify significant SNPs under selection. Two-sided p value s were derived as pRsb = − log10[1-2|Φ(Rsb)-0.5|] and pXP-EHH = − log10[1-2|Φ(XP-EHH)-0.5|] where Φ (x) represents the Gaussian cumulative distribution function. In EHH-based tests, the maximum allowed gap between two SNPs was set to 500 Kb. We used 1-Mb sliding windows that partially overlapped 10 kb with adjacent windows to perform selection signature detection. A window is classified as putatively under selection when it contains at least 3, 4 and 4 markers exceeding the significance threshold of − log10 (p value) = 3 for iHS, Rsb and XP-EHH tests, respectively. Finally, we checked the overlap of the candidate genomic regions detected with at least two EHH-based approaches with the previously identified bovine Quantitative Trait Loci (QTL) available in the cattle QTL database (https ://www.anima lgeno me.org/cgi-bin/QTLdb /BT/index ). The overlaps were checked using QTL coordinates according to the Bos taurus genome assembly ARS-UCD1.2.
Gene identification and functional enrichment analysis. Candidate genome region intervals detected by at least two EHH-based methods (iHS, Rsb, XP-EHH) were interrogated for genes annotated to the Bos taurus genome assembly ARS-UCD1.2 using BioMart tool of Ensembl (https ://www.ensem bl.org/bioma rt/ martv iew/c8fe3 a6996 1a408 8a55b 7a249 db7e2 fa). Cattle structural variants which overlapped the genomic coordinates (in bp) of these relevant candidate selective sweep regions were retrieved using the same database. We have only considered structural variants of less than 8 Mb which corresponds to the maximum size that can be identified, from whole genome sequence data, by the pindel software (https ://gmt.genom e.wustl .edu/packa ges/ pinde l/user-manua l.html). We used the online tool, Database for Annotation, Visualization and Integrated Discovery (DAVID) software version 6.8 (https ://david .ncifc rf.gov/) for functional enrichment analysis of the genes retrieved from BioMart. GO enrichment analysis included two aspects: Biological Process and Molecular Function. For the GO functional groups and InterPro functional terms returned from DAVID functional analysis, we considered an adjusted Benjamini-corrected p value threshold of ≤ 0.05.