Genetic diversity and signatures of selection in four indigenous horse breeds of Iran

Indigenous Iranian horse breeds were evolutionarily affected by natural and artificial selection in distinct phylogeographic clades, which shaped their genomes in several unique ways. The aims of this study were to evaluate the genetic diversity and genomewide selection signatures in four indigenous Iranian horse breeds. We evaluated 169 horses from Caspian (n = 21), Turkmen (n = 29), Kurdish (n = 67), and Persian Arabian (n = 52) populations, using genomewide genotyping data. The contemporary effective population sizes were 59, 98, 102, and 113 for Turkmen, Caspian, Persian Arabian, and Kurdish breeds, respectively. By analysis of the population genetic structure, we classified the north breeds (Caspian and Turkmen) and west/southwest breeds (Persian Arabian and Kurdish) into two phylogeographic clades reflecting their geographic origin. Using the de-correlated composite of multiple selection signal statistics based on pairwise comparisons, we detected a different number of significant SNPs under putative selection from 13 to 28 for the six pairwise comparisons (FDR < 0.05). The identified SNPs under putative selection coincided with genes previously associated with known QTLs for morphological, adaptation, and fitness traits. Our results showed HMGA2 and LLPH as strong candidate genes for height variation between Caspian horses with a small size and the other studied breeds with a medium size. Using the results of studies on human height retrieved from the GWAS catalog, we suggested 38 new putative candidate genes under selection. These results provide a genomewide map of selection signatures in the studied breeds, which represent valuable information for formulating genetic conservation and improved breeding strategies for the breeds.


INTRODUCTION
Different agro-ecological conditions across Iran have resulted in the development of distinct indigenous horse breeds. Iran has five officially registered horse breeds, including the Persian Arabian, Caspian, Turkmen (Akhal-Teke), Dareshuri, and Kurdish, with longstanding husbandry, cultural, and historical importance (Fotovati 2000). Unfortunately, these breeds are classified as endangered populations (Sadeghi et al. 2019) due to their small population size and uncontrolled admixture with exotic breeds. Due to the lack of active and strong equine breed associations in Iran, there is no clear strategy for the breeding of these populations. Breed purity is the most important factor for the pricing of Caspian and Kurdish horses and, as a result, horse owners mate their mares with stallions that have a valid pedigree. In Kurdish and Turkmen horses, due to the availability of breed and beauty competitions, in addition to their pedigree information, stallions are selected and mated based on their own and their relatives' performance in these competitions. Since genetic diversity is crucial for a population to adapt to changing environments, an assessment of genetic diversity and the selection pressures behind it will help to choose the most appropriate solution in this situation.
Consequently, breeding programs must be designed to prevent further loss of genetic diversity.
The Arabian horse is distributed worldwide, with a population size of >1 million (World Arabian Horse Organization, WAHO). However, this breed is divided into relatively small breeding populations in many countries in the Middle East, including Egypt, Saudi Arabia, Syria, and Iran, each with very different phenotypic attributes (Forbis 1976). Based on the current evidence from Y-chromosome analysis, three discrete Y-chromosomal haplotypes specific to the Arabian horses were detected (Remer et al. 2022). The Turkmen horse is thought to be the oldest surviving horse breed in the world, divided into several populations with unique genetic resources, which mostly differ in type, conformation, and usage (Jiskrová et al. 2016). The Caspian horse is a small ancient breed, with its natural habitat in the north of Iran. Although it was suggested that the Turkmen and Caspian horses might be ancestral to all forms of the oriental horse (Firouz 1998), most modern male lineages were derived from two major subclades, including Turkmen and Arabian lineages based on Y-chromosome analysis (Wallner et al. 2017). The study by Remer et al. (2022) showed a clear distinct Y haplotype phylogeny between Turkmen and Arabian horse breeds. The Kurdish horse originates from the west of Iran, an area characterized by mountainous topography and a moderately cold climate. This environment created a unique horse population resistant to harsh environmental conditions. The Iranian Kurdish horse has a small population size of only about 7000 horses. Therefore, an optimized selection strategy and mating program has been suggested to maintain genetic diversity (Nazari et al. 2022).
Despite several studies on different native Iranian horse breeds that assessed the genetic diversity, parentage verification, and genetic structure of the populations using microsatellite or singlenucleotide polymorphism (SNP) (Amjadi et al. 2021;Nazari et al. 2022;Sadeghi et al. 2019;Salek Ardestani et al. 2022;Seyedabadi et al. 2006;Shahsavarani and Rahimi-Mianji 2010), the majority of the studies were performed on a single breed (Ala- Amjadi et al. 2017;Gharahveysi and Irani 2011;Nazari et al. 2022;Rahimi-Mianji et al. 2015;Seyedabadi et al. 2006). However, Moridi et al. (2013) studied the mitochondrial DNA (mtDNA) diversity and origin of five Iranian native horses, including Turkmen, Kurdish, Caspian, Iranian Arabian, and Sistani breeds. In 2019, Sadeghi et al. (2019) performed a study using five Iranian horse breeds, including Turkmen, Kurdish, Caspian, Iranian Arabian, and Dareshuri, where they focused on the genetic diversity and genomewide selection signatures in the Persian Arabian. Subsequently, Yousefi-Mashouf et al. (2021) compared the Kurdish horse population with the Persian Arabian and American Thoroughbred populations. Their results showed a significant population structure pattern between Kurdish and Persian Arabian samples. To date, there have been no genomewide SNP-based studies of genetic diversity and on detecting signatures of selection in multiple Iranian native breeds.
Different statistical tests may not generate consistent results for detecting signatures of selection due to variation in their power as a consequence of demographic history, type of selection, genetic architecture, and experimental design, as well as variation in sensitivity to sampling design (Lotterhos and Whitlock 2015;Schlamp et al. 2016;Vatsiou et al. 2016). ***Combining the results from multiple tests generally perform better with more distinguishable footprints of positive selection, and importantly in a closer proximity to the real selected locus (Lotterhos et al. 2017b). Many strategies have been proposed to overcome these issues, such as methods based on combining P-values of different test statistics (composite measures of selection) (Lotterhos and Whitlock 2015;Randhawa et al. 2014). To improve statistical power and resolution, we combined multiple statistics of signatures of selection including F ST (Wright 1949), FLK (the extension of the Lewontin and Krakauer test) (Bonhomme et al. 2010) and xp-EHH (cross-population extended haplotype homozygosity) (Sabeti et al. 2007) within a single de-correlated composite of multiple selection signals (DCMS) framework (Ma et al. 2015). This calculation combines P-values while considering the correlation between the various statistics.
This study evaluates four Iranian horse breeds: Turkmen, Caspian, Kurdish, and Persian Arabian. We used the genomewide SNP information in a comprehensive genomewide analysis of the genetic diversity of Iranian horses to investigate the population structure and their genetic differences. Additionally, we applied a genomewide analysis to detect breed-specific genomic regions that display signals characteristic of selection. The selected genomic regions may contain variants important for the unique physiological traits and adaptations characteristics of each Iranian horse breed.

MATERIALS AND METHODS Samples and populations
The data represent four different data sets. In this study, we used a data set collected by the Animal Science Research Institute of IRAN (ASRI), with the main focus on establishing genomic information for the Kurdish breed (n = 81). Horses in the ASRI data set were selected based on pedigree information, without crossbreeding in their pedigree, and those passed microsatellite parentage testing. These samples were collected from four Iranian provinces (Kermanshah, Kurdistan, Isfahan, and Kerman). One Persian Arabian (collected from Tehran) and two Turkmen (collected from Isfahan and Tehran) samples were present in the ASRI data set used in this study. A total of 86 samples from three breeds, including Turkmen (n = 34), Caspian (n = 22), and Kurdish (n = 30), were collected by the National Animal Breeding Center (ABC) from six provinces (Kermanshah, Kurdistan, Isfahan, Golestan, Mazandaran, and Tehran). This data set was collected between 2014 and 2015. For each breed, horses were sampled based on their pedigree. To extend and include Persian Arabian horses in our samples, we merged our data set with the data from two studies: 96 samples from Sadeghi et al. (2019), including Persian Arabian, Turkmen, Caspian, and Kurdish breeds, and 9 Persian Arabian samples from Cosgrove et al. (2020).

Genotyping and quality control
Genomic DNA from the ASRI and ABC samples was extracted from more than 50 hair roots collected from the horse's mane of each sample using the DNeasy Blood and Tissue kit (Qiagen, Germany). Then, samples were genotyped at Neogen GeneSeek Inc. (Lincoln, NE, USA) using the GGP Equine 70K SNP BeadChip array (Neogen GeneSeek, Lincoln, NE). Samples from the two other data sets were genotyped with the 670K Equine SNP chip as described previously (Cosgrove et al. 2020;Sadeghi et al. 2019). The annotation of SNP chromosomal positions was initially reported as EquCab2.0, and then converted to the most recently updated map using the lift genome annotations tool (https://genome.ucsc.edu/cgi-bin/ hgLiftOver) based on the EquCab3.0 assembly (Kalbfleisch et al. 2018).
The genotype data from different data sets were merged, keeping the loci that overlapped between the chips using PLINK V1.9 (Purcell et al. 2007). As our analysis focused on autosomal variants, we excluded the SNPs on sex chromosomes. After merging the data, we had 275 samples and 40,813 SNPs for further quality control (QC) (Fig. 1). The QC was performed using PLINK V1.9 (Purcell et al. 2007). First, SNPs and samples were filtered based on a less than 90% genotyping rate. Second, SNPs with a minor allele frequency (MAF) lower than 1% or those that departed from Hardy-Weinberg proportions at P < 10 -6 were discarded. Third, as closely related individuals within each breed of the horse samples may provide biased sets for use in some of the analyses, we filtered individuals based on pairwise identity-by-descent (IBD) less than 0.25. The --genome option in PLINK was implemented to calculate the IBD values. The number of removed SNPs and samples after each step of QC are shown in Fig. 1. Before removing related individuals, Beagle V5.2 software was used to impute the sporadic missing genotypes in each population separately (Browning and Browning 2007). As well described by Pook et al. (2020), the effective population size (N e ) can have a major effect on the error rate for imputation in Beagle, wherein the Markov chain uses the genetic data set to initialize a haplotype cluster and then initializes the Hidden Markov Model by identifying the most likely path through the haplotype cluster. Based on the default setting, this works well for imputation in outbred human populations, but the algorithm can be adjusted to the specific genetic structure of the respective data set by changing the N e parameter. Pedigrees in Iranian horse breeds have much less genetic diversity than in human populations, which results in a smaller N e . To deal with this, we predicted the N e for each breed using NeEstimator V2.1 (Do et al. 2014); the estimated N e was used as the input for Beagle (see the section "Estimation of effective population size using genomewide SNP data").
As LD-based pruning reduces the effect of ascertainment biases on diversity estimates (Malomane et al. 2018), a subset of SNPs was selected by linkage disequilibrium (LD)-based SNP pruning using PLINK for genetic diversity and population structure analysis. For principal component analysis (PCA), we applied the indep-pairwise command with an SNP window size of 50, 5 SNPs shifted per step, and the squared correlation between SNPs (r 2 ) ≤ 0.5. For admixture and relatedness analysis, SNPs were pruned using the same sliding window and SNPs shifted per step, while the r 2 threshold was 0.3. After LD-based SNP pruning, 28,387 and 22,523 SNPs remained for PCA, and admixture and relatedness analysis, respectively.
Genetic distances, pairwise F ST , and absolute allele frequency difference (AFD) The Cavalli-Sforza and Edwards Chord distance (Takezaki and Nei 1996) was calculated from the SNP information using the R package hierfstat S.F. Mousavi et al. (Goudet 2005). We applied the R package hierfstat to calculate the pairwise F ST between the four main Iranian horse breeds (Goudet 2005). A bootstrap confidence interval was calculated for each pairwise F ST comparison. As calculations of F ST can be impacted by asymmetry in sample sizes, we calculated the absolute allele frequency difference (AFD) estimates to display the magnitude of differentiation (Berner 2019). The AFD at each locus was calculated as where n is the number of alleles observed at the SNP, and p i1 and p i2 are the frequencies of allele i in populations 1 and 2, respectively.
Estimation of effective population size using genomewide SNP data We estimated both historical and contemporary N e for all breeds using all 40,120 quality-controlled SNPs. SNeP software was implemented to estimate the historical N e based on LD information (Barbato et al. 2015). We used the default options with a maximum distance of 4000 kbps. The contemporary N e was calculated using NeEstimator V2.1 (Do et al. 2014). We used a single-sample estimator based on LD information that implements a bias correction when the sample size is less than the true effective size (Waples and Do 2008). A random mating population was assumed for the analysis. MAFs of 0.1, 0.05, 0.02, and 0 for all SNPs were included in the analysis to screen out the effect of rare alleles on N e estimation. The jackknife method was used to calculate the confidence intervals, as this method can perform better than parametric confidence intervals for the LD method (Waples and Do 2008). To omit comparisons of loci on the same chromosome, LD was calculated among pairs of SNPs located on different chromosomes.

Genetic diversity and population structure
First, we performed a PCA on the LD pruned subset of SNPs using the "prcomp" function (with scale = TRUE) of the R statistical software (R Core Team 2013). Then, we used the first three PCs to identify and remove outlier samples, as possibly there are pedigree errors, especially in the past if a microsatellite test was not used for a pedigree check, leading to unreported crossbred individuals. The local outlier factor (LOF) algorithm was implemented to score each data point in a multidimensional process. LOF is an unsupervised anomaly detection method that computes the local density deviation of a given data point when only a restricted neighborhood of each object is taken into account (Breunig et al. 2000). We assumed that the outliers have a substantially lower density than their neighbors, which results in a gap from the rest of the data. All detected outliers were excluded in subsequent analyses. After removing outliers, PCA was conducted again on an LD pruned subset of SNPs without outliers to investigate the relationships among populations.
We also evaluated the relationship among the populations using the R package SNPRelate (Zheng et al. 2012). First, we calculated the individual dissimilarities for each pair of individuals using the "snpgdsDiss" function. Second, a hierarchical cluster analysis on this dissimilarity matrix was performed using the "snpgdsHCluster" function. Finally, subgroups of individuals were determined via a specified dendrogram from this hierarchical cluster analysis using the "snpgdsCutTree" function with 50,000 permutations.
We applied the software ADMIXTURE in order to infer breed origins and quantify the admixture of the populations (Alexander et al. 2009). For each individual, ancestry proportions were estimated with a priori-defined ancestry components (K) as a presumption of the number of ancestral populations (Alexander and Lange 2011). We implemented a 5-fold crossvalidation for each K (ranged from 1 to 7) to estimate the standard error values for comparison, as the lowest one is the most likely number of ancestral populations.
LD between all SNPs was calculated with the --r2 command in PLINK and mean LD was expressed as a function of distance to determine the diminishing of LD with increasing physical distance between SNPs.

Runs of homozygosity (ROH)
ROH, the indicator of genomic autozygosity due to a long stretch of homozygous SNPs that may have been inherited from a recent common ancestor, were studied to increase our understanding of the genetic diversity status of the four native horse populations. ROH segments along the genome were calculated based on the sliding window method using the R package detectRUNS (Biscarini et al. 2018). To estimate the ROH in four Iranian horse breeds, we used all common 40,120 quality-controlled SNPs in each breed separately. The following options were used: the sliding window size (windowSize) = 15, minimum number of homozygous SNPs in a run (minSNP) = 20, the threshold of overlapping windows (threshold) = 0.05, minimum number of SNP per kbp (minDensity) = 1/ 168, maximum distance between two SNPs (maxGap) = 10 6 bps, and the minimum length of a homozygous run (minLengthBps) = 500 kbps. The detected ROHs were divided into five categories, including 0 to < 4 Mb, 4 to < 8 Mb, 8 to < 16 Mb, 16 to < 32 Mb, and > 32 Mb. For each of the ROH length categories, the number of runs in each breed was calculated by averaging the sum of all ROHs per animal within the breeds. To identify the position of a ROH peak, we used a threshold of 0.7, which indicates the ROH shared in more than 70% of the individuals within each population. The genomic inbreeding coefficient using ROH information (F ROH ) was calculated as follows: where L ROHj is the length of the jth (j = 1,…, p) ROH detected in an individual, and L genome is the total length of the genome that was used.

Selection sweep
We performed six pairwise comparisons, including (i) Turkmen vs. Caspian, (ii) Turkmen vs. Kurdish, (iii) Turkmen vs. Persian Arabian, (iv) Caspian vs. Kurdish, (v) Caspian vs. Persian Arabian, and (vi) Persian Arabian vs. Kurdish, to determine the genetic regions under selection using F ST (Wright 1949), FLK (Bonhomme et al. 2010), and xp-EHH (Sabeti et al. 2007) analyses. The F ST statistic was estimated using the program VCFtools v0.1.16 (Danecek et al. 2011). For each comparison, Z transformation of the mean of F ST values (Z(F ST )) was computed for all SNPs using the scale function in R software.
FLK is an extension of the original LK statistic (Lewontin and Krakauer 1973) that is based on the variance of F ST on comparing the observed and expected variances of F ST across loci. FLK accounts for complex demographic structures and differences in the effective population size using a phylogenetic estimation of the population's kinship (F) matrix to consider historical branching and heterogeneity of the genetic drift (Bonhomme et al. 2010). The FLK's P values were computed using the hapFLK software (Fariello et al. 2013). In this analysis, the kinship matrix was calculated from the Reynolds' genetic distances between populations (Reynolds et al. 1983) and a phylogenetic tree was fitted from these distances using the neighbor-joining algorithm. For each comparison, a negative log P-value was used.
An LD-based method, EHH, proposed by Sabeti et al. (2007) is useful in the detection of long homozygous regions that are under partial or soft selective sweep within a population. This test can be extended for implementation in the cross-population aspect, xp-EHH, which compares populations regarding their haplotypes (Sabeti et al. 2007). The xp-EHH scores were calculated using the REHH package (Gautier and Vitalis 2012) in R to determine selected mutations with a higher frequency than expected according to their haplotype length.

De-correlated composite of multiple signals (DCMS)
Combining various test statistics of the selection signature improves the signal-to-noise ratio and increases the resolution to identify selected genomic regions (Lotterhos et al. 2017a;Ma et al. 2015). We combined three genomewide selection signature statistics into a single DCMS value (Ma et al. 2015). DCMS accounts for the correlation between the different statistics while combining P-values produced by several statistics for each locus into a single DCMS measurement.
We calculated the genomewide DCMS for each pairwise comparison by combining the aforementioned statistics (F ST , FLK, and xp-EHH) for each locus as described in Yurchenko et al. (2018). The genomewide P-values were converted to fractional ranks for each statistic using the stat_to_pvalue function represented in the R package MINOTAUR (Verity et al. 2017) with the right-tailed test for all the three methods. The covariance matrix between the different statistics was estimated using the CovNAMcd function from the rrcovNA package in R (Todorov et al. 2011) with alpha = 0.75 and 30,000 randomly sampled SNPs to calculate an n × n covariance matrix. Then, this calculated covariance matrix was used as an input for the DCMS function in the MINOTAUR R package to calculate the DCMS statistics. The output of the DCMS function, i.e. the DCMS statistics, was fitted to the normal distribution using the robust linear model method implemented in the rlm R function (model = rlm (dcms~1), where the dcms is a vector of the raw DCMS values) of the MASS package (Venables and Ripley 2013). The rlm R function produced a mean and standard deviation that were used as inputs in the pnorm R function with a lower.tail attribute equal to FALSE. Finally, the DCMS P values were transformed into the corresponding q-values based on the Benjamini and Hochberg procedure (Benjamini and Hochberg 1995) to control the multiple testing false discovery rate (FDR < 0.05). This transformation was conducted using the p.adjust R function.

Identification of candidate genes, QTLs, and functional analysis
Equine gene annotations and gene ontology (GO) information from the horse genome assembly build Equcab3.0 were downloaded from Biomart in November 2021 (Kasprzyk 2011). To identify the putative candidate genes under selection, we extracted genes including SNPs with a qvalue < 0.05 in the coding region (both exon and intron) or including ± 40 kbps at both sides to control the regulatory regions. We selected 40 kbps extension based on the LD decay result from all breeds, where the average r 2 was >0.2 (see the section "Population genetic structure and linkage disequilibrium"). The gene ontology (GO) pathway enrichment analysis was performed by the enricher function from the clusterProfiler R package (Wu et al. 2021). A list of significant quantitative trait loci (QTLs) for the horse database (horse QTLdb) was retrieved from animalQTLdb (Hu et al. 2007) (http://www.animalgenome.org/QTLdb, Release 46, December 27, 2021), which includes 2605 QTL from 104 publications on 64 different traits. To detect the putative QTL under selection, we considered the QTLs that were located in genomic regions, including consecutive SNPs with a q-value lower than 0.05 or extended up to ± 40 kbps at both sides. We extracted the associated recorded genes for height in humans accessible on the genomewide association study (GWAS) catalog database (https:// www.ebi.ac.uk/gwas) (Buniello et al. 2019) to identify new possible candidate genes for height variation in Iranian horse populations.

RESULTS
A total of 275 genotyped horse samples using two types of SNP assays from four data sets were included in this study (Supplemental file, Table S1). After QC, 4 and 83 samples were removed from the data due to the genotype rate and high relatedness, respectively. The four main Iranian horse breeds under study have their own characteristics (see Supplementary file, Table S2) and were developed in different environmental conditions ( Fig. 2 and Supplementary file, Table S3). To quantify and control for the extreme deviation of a sample from samples of the same breeds, outliers were detected and removed based on the first three PCs (Supplementary file, Fig. S1). A total of 19 samples were excluded from the final data (Supplementary file, Fig. S2). More information about the outliers is available in Supplementary Table S4 and Fig. S3.
Genetic distances, pairwise F ST , and AFD Genetic distance analyses revealed that the Persian Arabian and Kurdish samples had the lowest value (Table 1). The Kurdish and Turkmen samples were intermediate in their genetic distance, followed by the Turkmen and Caspian samples. Persian Arabian horses showed the greatest divergence from the Caspian and Turkmen samples. Considering a 95% confidence interval, all breeds were distinguishable from one another with a pairwise F ST >0 (Table 1). The Turkmen and Persian Arabian samples showed the highest divergence. The lowest pairwise F ST observed between breeds originated from more geographically similar locations, including Turkmen vs. Caspian, and Kurdish vs. Persian Arabian. Considering AFD as a metric of genetic differentiation between the studied horse samples with asymmetrical sample size, the lowest magnitude of pairwise comparisons was in Kurdish vs. Persian Arabian (Table 1). The Kurdish showed similar AFD values to the Caspian and Turkmen horses, followed by the Turkmen and Caspian samples. Similar to the results for genetic distance, the highest AFD values were observed in the comparisons between Persian Arabian and the two Caspian and Turkmen breeds that were sampled in geographically distinct locations.
Estimation of effective population size A decrease in N e over time was apparent in all breeds, with a sharper decay for the more recent generations. The historical N e pattern of the Caspian breed differed from that of the other studied breeds, with the Caspian breed having the lowest N e in the recent generations (≤45 generations ago) and the second highest N e in the older generations (≥54 generations ago) (Fig. 3). The Kurdish population showed the greatest N e .
On the basis of LD estimations, the contemporary N e was estimated with various thresholds used to screen out rare alleles ( Table 2). The lowest estimated contemporary N e was for Turkmen with a range from 52.3 to 59.3. Values for the Caspian and Persian Arabian breeds, respectively, were similar: at 86.9 and 86 for MAF less than 0.1 and up to 98.4 and 101.7 using all SNPs. The Kurdish breed had the highest contemporary N e compared to the other breeds.
Population genetic structure and linkage disequilibrium The first PC separated breeds were based on the two geographic origin clusters, including north of Iran (for Caspian and Turkmen) and southwest and/or west of Iran (Persian Arabian and Kurdish) (Fig. 4A). PC1 also helped distinguish between the Persian Arabian and Kurdish breeds. The second PC separated the breeds within the two geographic origins from each other. Furthermore, the Caspian breed was separated from the Turkmen breed by the third PC.
The relationships indicated in the relatedness analysis using identity-by-descent measures (Fig. 4B) were supported by the separation patterns that were identified using PCA. The Caspian and Turkmen branches were most proximal to each other compared to both the Persian Arabian and Kurdish breeds. Our analysis divided the Kurdish breed into two separate branches, suggesting distinct subpopulations within this breed. The presence of both Persian Arabian and Kurdish horses in several instances suggests either recent crossbreeding or shared ancestry among these horses.  Admixture analyses were performed by assuming up to 7 clusters (K) to identify the most likely number of ancestral populations as determined by the cross-validation (CV) algorithm. The results are shown in Fig. 4C (for K = 2 to 3) and Supplementary file, Fig. S5 (for K = 4 to 7). The lowest CV error was detected for the model with two ancestral populations, which was very close to the error detected for K = 3. In K = 2, the Caspian and Turkmen samples were assigned primarily to one genetic cluster and the Persian Arabian sample to the other; the Kurdish horses had a relatively equal assignment to both clusters. With K = 3, the Caspian and Turkmen samples became more distinct, but with evidence of population structure in the Turkmen sample, as some horses shared similar ancestry proportions to Caspian horses. The Persian Arabian horses remained primarily assigned to the third cluster, apart from the Caspian and Turkmen horses, but with some Persian Arabian horses having admixtures from the genetics largely represented by the Caspian horses. The Kurdish horses remained admixed, but horses of that breed showed similar proportions of genetic ancestry from each of the three genetic clusters. The admixture results for higher numbers of ancestral populations are shown in the Supporting file, Fig. S5.
The LD patterns between the studied breeds indicated that the mean of r 2 in all breeds dropped rapidly up to approximately 100 kbps (Supplementary file, Fig. S4). The Persian Arabian population showed the slowest decay of LD compared to the other breeds. The average r 2 at 100 kbps for all, Turkmen, Caspian, Kurdish, and Persian Arabian breeds were 0.143, 0.161, 0.151, 0.152, and 0.183, respectively.

Runs of homozygosity (ROH)
A total of 60,408 ROH segments was identified in all the studied breeds (Supporting file, Table S5). As the sample size was different for the breeds, the absolute number of ROH in each breed is not comparable. The majority of ROH segments were classified in the shortest category with the length less than 8 Mb (> 97%). The total number of ROH segments in Kurdish, Persian Arabian, Turkmen, and Caspian horses were 24,194,18,959,9846, and 7409, respectively. The number of ROH segments exceeding 32 Mb length in Caspian, Kurdish, Persian Arabian, and Turkmen horses were 16, 13, 9, and 6 segments, respectively. Using a threshold of 0.7 for the shared ROH within each population, 24 peaks were found in Turkmen (2 peaks), Caspian (14 peaks), and Persian Arabian (8 peaks) populations (Supplementary file, Table S6).
Genomic inbreeding based on ROH (F ROH ) was estimated for each population separately (Supplementary file, Fig. S6). The mean estimated genomic inbreeding was similar across breeds, ranging from 21% in the Turkmen population to 26% in the Persian Arabian population. The Kurdish and Caspian populations showed a very similar inbreeding with means of 22% and 23%, respectively.
Caspian vs. Kurdish. A total of 27 significant SNPs were identified in this comparison (Supplementary file, Table S7). The most significant SNPs were located close to LLPH (chr6: 82.8 Mbps) with a DCMS value of 14.86 (Fig. 5). Gene annotation analysis of the significant SNPs in this pairwise comparison detected 45 (6 lncRNA, 2 miRNA, 1 pseudogene, and 36 protein-coding genes) candidate genes. Based on the GWAS catalog database, 19 of the candidate genes have been associated with human height. We also found 6 significant SNPs with at least one QTL located near to them in 5 different QTL types, including insect bite hypersensitivity (n = 2), wither height (n = 4), hair density (n = 1), alternate gaits (n = 1), and navicular bone morphology (n = 1). Of those, two suggested QTLs under selection for wither height located on chr11 (with QTL IDs 166105 and 165767) did not show selection pressure in the two other Caspian pairwise comparisons. No GO terms were significantly overrepresented in this comparison.
Caspian vs. Persian Arabian. We identified 24 significant SNPs in this comparison, with the most significant SNP on chr16 (71.8 Mbps) with a DCMS value of 14.59. Gene annotation analysis of the significant SNPs detected 28 (8 lncRNA, 1 miRNA, 1 snRNA, and 18 protein-coding genes) candidate genes. Of them, 9 candidate genes were found in the GWAS catalog list. The results showed 19 significantly overrepresented GO terms for the Caspian vs. Persian Arabian comparison. The most significant GO term was GO:0016790 (thiolester hydrolase activity). Among the significant SNPs, 4 loci were located close to 7 QTLs, which were classified into 5 QTL types, including immunoglobulin E level (n = 1), wither height (n = 2), alternate gaits (n = 2), equine sarcoids (n = 1), and immunoglobulin G level (n = 1).
Turkmen vs. Kurdish. Based on the DCMS method, 25 significant SNPs were detected on 10 different chromosomes (chr8, chr10, chr13, chr14, chr15, chr17, chr21, chr25, chr29, and chr31), with the most significant SNPs on chr10 (32.7 Mbps). Gene annotation analysis of these significant SNPs resulted in 28 candidate genes (4 Fig. 4 Population structure and relationship between Caspian, Turkmen, Persian Arabian, and Kurdish breeds tested in this study. A Principal component analysis based on the first three PCs. The four main Iranian breeds show genetic differentiation associated with their geographic origin clustered into two categories, including north of Iran (for Caspian and Turkmen) and southwest and/or west of Iran (Persian Arabian and Kurdish). B A relatedness tree was constructed using whole-genome SNP data. C Model-based clustering of four Iranian breeds using admixture analysis with the assumed number of ancestries of 2 and 3 (left) with a plot for CV errors of K ranging from 1 to 7 (right).
lncRNA, 1 snoRNA, 1 snRNA and 22 protein-coding genes). Of those, 11 candidate genes were located in the GWAS catalog database for human height. Gene enrichment analysis of the candidate genes detected 2 significantly overrepresented terms that were classified as belonging to the cellular component domain. The comparison detected only 1 significant SNP close to 2 QTLs that was classified as type 1, navicular bone morphology.
Turkmen vs. Persian Arabian. We identified a total of 13 significant SNPs on chr2, chr5, chr14, chr16, chr19, and chr27. The most significant SNP based on the DCMS method was located on chr27 (37.0 Mbps). Gene annotation analysis of these significant SNPs found 15 candidate genes for the pairwise comparison. Of them, 7 genes were found in the GWAS catalog list for human height. Gene enrichment analysis of the candidate genes detected 29 significantly overrepresented terms in the comparison, while the four top significant terms were related to the molecular function domain. In the comparison, 2 of 13 significant loci were located close to 3 QTLs, which were classified into 3 types, including immunoglobulin E level (n = 1), osteochondritis dissecans (n = 1), and osteochondrosis (n = 1).
Kurdish vs. Persian Arabian. Here, 29 SNPs with q-values less than 0.05 were detected, with the most significant signal on chr8 (52.2 Mbps). A total of 31 candidate genes was detected in the gene annotation analysis in the comparison. Of them, 9 genes were located in the GWAS catalog list. One of these, TGFB2 (chr30: 143 Mbps), has been noted as a candidate for human height in 10 different GWAS studies (Akiyama et al. 2019). TGFB is a growth factor and key regulator of several traits related to body composition, growth, and development in chicken (Dadousis et al. 2021;Li et al. 2003  Genes. We identified 8, 7, and 9 candidate genes overlapping in Caspian pairwise comparisons for Turkmen and Kurdish, Turkmen and Persian Arabian, and Kurdish and Persian Arabian analyses, respectively (Fig. 6). The results showed that 7 candidate genes were present in all the three Caspian pairwise comparisons. One of the significant SNPs, MNEc.2.6.81451782.BIEC2-1024200, was located within HMGA2. The allele G was the most common in the Caspian population with 0.95 frequency, while the frequencies for Turkmen, Kurdish, and Arabian populations were 0.33, 0.35, and 0.23, respectively (Supplementary file, Table S8). Our gene annotation analysis for the significant SNPs in all the three Caspian comparisons also detected another overlapped gene, LLPH, which was a candidate for selection in the Caspian vs. Kurdish comparison. We also identified a candidate gene located in the GWAS catalog list for human height, TRIM2 (chr2: 80.8 Mbps), which was common in the pairwise comparisons between Caspian and both Kurdish and Persian Arabian populations. There were 6 and 14 additional candidate genes for the pairwise comparisons between Caspian and both Persian Arabian and Kurdish populations, respectively, which were located in the GWAS catalog list for human height (Supplementary file, Table S7). We found 1 gene, IMPG1 (chr10: 31.2 Mbps), overrepresented in the comparisons between Turkmen and both Caspian and Kurdish populations, which is an ortholog for a protein-coding gene in humans and is a major component of the interphotoreceptor matrix (IPM). The IPM is suggested to have many important functions related to growth factors, such as regulation of retinoid transport, participation in cytoskeletal organization in the surrounding cells, and regulation of oxygen and nutrient transport (Ishikawa et al. 2015). Investigation of the genes for selective pressure in the comparisons between Turkmen and both Persian Arabian and Kurdish populations showed two common genes, ENSECAG00000042329 and U6 (chr29: 6.9 Mbps). U6 was associated with human height in 16 different GWASs (Kichaev et al. 2019;Sakaue et al. 2021).
The common SNPs resulted in identification of 5 nearby genes in the comparisons between Persian Arabian and both Turkmen and Caspian breeds. There was also 1 common candidate gene, CSMD1 (chr27: 35.8 Mbps), listed in the GWAS catalog database, in the comparisons between Persian Arabian and both Turkmen and Kurdish breeds. CSMD1 is a large transmembrane complement inhibitor with expression in many tissues such as testis, brain, lung, colon, thyroid gland, breast, and pancreas (Fagerberg et al. 2014). This gene is associated with several pathological processes, from neurodegenerative and psychiatric disorders to infertility and cancer (Gialeli et al. 2021). The comparisons between Persian Arabian and both Caspian and Turkmen breeds showed another 4 overrepresented candidate genes, CEP63 (chr16: 71.5 Mbps), kyphoscoliosis peptidase (Ky) (chr16: 71.8 Mbps), EPHB1 (chr16: 71.8 Mbps), and U6, which were listed in the GWAS catalog for human height.
QTLs. Some of the previously identified QTLs were found in the suggestive regions of selection in each of the comparisons (Supplementary file, Table S7). Among the different QTL types found in the Caspian pairwise comparisons, 2 QTLs (with QTL IDs 28316 and 28297) related to wither height and one QTL (with QTL ID 165888) related to alternate gaits were common in the comparisons between Caspian and all the other breeds (Supplementary file, Fig. S7). Concerning the QTL types in the three comparisons for Turkmen, the QTL for osteochondrosis was common in Turkmen vs. Caspian (with QTL ID 27159) and Turkmen vs. Persian Arabian (with QTL ID 28236) comparisons. Navicular bone morphology QTL type was common in all the three comparisons between Kurdish and three other breeds. Immunoglobulin E level QTL type (with QTL ID 28467) was represented in both Persian Arabian vs. Turkmen and Persian Arabian vs. Caspian comparisons. In addition, wither height QTLs were detected in comparisons between the Persian Arabian and both the Caspian and Kurdish samples.
GO. The results showed 17 and 19 significantly overrepresented GO terms for Caspian vs. Turkmen and Caspian vs. Persian Arabian comparisons, respectively, while the only GO:0060999 term (positive regulation of dendritic spine development) was present in these two comparisons (Supplementary file, Fig. S8). One GO term, GO:0015026 (coreceptor activity), was common in two Turkmen vs. Persian Arabian and Turkmen vs. Caspian comparisons. There were no significantly overrepresented GO terms in the three comparisons between Kurdish and the other breeds. A total of 10 GO terms were overrepresented in both Persian Arabian vs. Turkmen and Persian Arabian vs. Caspian comparisons. Two of these common GO terms, GO:0060996 (dendritic spine development) and GO:0060997 (dendritic spine morphogenesis), are related to the dendritic spine.

DISCUSSION
A better understanding of the effects of the evolution and adaptability of livestock species to different environments can improve our knowledge about the selection process (Saravanan et al. 2022). Iranian horse populations have been subject to different breeding objectives in distinct geographic regions with variation in the components of the environment such as temperature, humidity, and altitude ( Fig. 2 and Supplementary file, Table S3). Our findings showed a genetic division among the studied breeds that matched their different geographic origins. These results refine our understanding of the genetic relationship between these four Iranian horse breeds. We also identified genomic regions under selective pressures, and that these regions previously had been reported to be associated with morphology, adaptation, and fitness (Frischknecht et al. 2015;Smith et al. 2007), which are important phenotypic traits in horse breeding. Therefore, these results can support future genomewide association studies and subsequent fine mapping of important traits to improve our understanding of the genetic basis of these traits and allow for improved horse breeding.

Effective population size
Simplified, N e corresponds to the number of breeding animals in a population under idealized conditions (Wright 1931). Reduction in N e , as a result of decrease of genetic diversity within a population, may significantly reduce the population's ability to adapt to perturbations such as extreme weather conditions, disease, and infections (Palstra and Ruzzante 2008), and this may in turn increase the risk of extinction (Newman and Pilson 1997). The contemporary N e estimations performed within this study indicate that all four horse breeds should be considered for conservation, most urgently for the Turkmen horse breed, which showed the lowest N e with a value close to 50. The historical N e values computed were of the same magnitude as those reported in other studies using similar approaches (Corbin et al. 2010;Jasielczuk et al. 2020;Nazari et al. 2022;Salek Ardestani et al. 2022). Our study represents the first estimates of contemporary N e for the Turkmen, Caspian, Kurdish, and Persian Arabian populations, and alongside, the other studies that estimated their historical N e can help to quantify the magnitude of genetic drift and inbreeding in realworld Iranian horse populations (Nazari et al. 2022;Salek Ardestani et al. 2022).
Population genetic structure A clear genetic division among 36 globally distributed horse breeds was previously detected using a genomewide set of autosomal SNPs, and this genetic diversity largely reflected geographic origins and known breed histories (Petersen et al. 2013). Similarly, our results well matched the studied breeds with different geographic origins, as we classified the north breeds (Caspian and Turkmen) and southwest and west breeds (Persian Arabian and Kurdish) into two phylogeographic clades. These results are consistent with the results from Sadeghi et al. (2019), in which the PCA results demonstrated that the genetic variation was associated with the separation among indigenous horse breeds from different parts of Iran. We observed close genetic relationships between the Iranian horse breeds originating from the same geographic region. In concordance, relatedness and admixture analysis showed a close genetic relationship between Persian Arabian and Kurdish breeds, as well as between Turkmen and Caspian breeds.

Inbreeding
VanRaden et al. (2011) showed that pedigree measures of inbreeding values are not equivalent to the genomic inbreeding values. This difference may lead to an imperfect reflection of the level of homozygosity in the genome. ROH, the indicator of genomic autozygosity, are used more often to estimate the degree of realized genomic inbreeding (Chhotaray et al. 2021;Dadousis et al. 2022;Eydivandi et al. 2021b;Metzger et al. 2015;Saravanan et al. 2021). On assessing different genomewide SNPbased estimators of inbreeding using computer simulations, F ROH provided a more precise estimate of inbreeding than could be expected by using pedigrees (Caballero et al. 2022). On comparing different marker-based estimators, F ROH showed the highest correlation with pedigree-based coefficients in cattle using imputed SNP data (Dadousis et al. 2022). Long homozygous stretches and consequently high inbreeding coefficients characterize closed populations or those derived from a relatively narrow genetic base such as the Arabian and Thoroughbred populations (Metzger et al. 2015). In concordance, the Persian Arabian horses in our study showed a relatively high number of larger ROH (with sizes > 4 Mb, see Supplementary file, Table S5), probably due to the more recent inbreeding, as the ROH will break up over time by recombination, and consequently a relatively higher value for F ROH , indicative of inbreeding. Thus, when considering matings at an individual animal level, genomic information can provide a more accurate measure of inbreeding (homozygosity across the genome). This information can be applied in mating systems with the aim of minimizing inbreeding depression, especially in the Persian Arabian horses, which showed the highest estimated genomic inbreeding value compared with the other Iranian breeds.

Genomic signatures of local adaptation
The genomes of local breeds have been under natural and artificial selection for centuries to transmit the desired properties to the descendants. These genetic changes shaped recent gene pools, and can be used for investigating the genetic relatedness among breeds (Eydivandi et al. 2021a;Ju et al. 2019;Rajawat et al. 2022). Our study showed that the genomes of Iranian horse breeds contain multiple genomic regions under selective pressures (adaptation and/or breeding), representing an opportunity for a more detailed investigation of the genes involved in these procedures. For instance, ADGRV1 (chr14: 76.8) was under selective pressure in the Arabian and Kurdish comparison. This gene is suggested to have highly differentiated nonsynonymous alleles, which leads to an excess of the adrenocorticotropic hormone in Yakutian horses, which are known to be highly adapted to subarctic environments (Librado et al. 2015). Another candidate gene, TMED2, has been reported to regulate innate immune signaling and to specialize in responding to environmental exposures (Sun et al. 2018). Additionally, EIF2B1 is associated with stress signaling, as inhibition of this gene mediates the downregulation of protein synthesis under stress (Slynko et al. 2021). TRAPPC9 was common in comparisons between Kurdish and both Persian Arabian and Caspian breeds. This has been reported to be important in the trafficking and signaling pathways in health and disease in humans, where mutations in TRAPPC9 are linked to a form of mental retardation, breast and colon cancer, and liver diseases (Mbimba et al. 2018). Among the 4 common candidate genes from comparisons between Persian Arabian and both Caspian and Turkmen breeds, Ky is a gene involved in muscle growth, as the absence of Ky protein leads to muscular dystrophy in mouse (Blanco et al. 2001). The suggested selective sweep regions were close to or overlapped immunoglobulins, insect bite hypersensitivity, and guttural pouch tympany, while other sweeps overlapped wither height and alternate gaits. This suggests that sweep regions may help to understand the biology of these breeds and to improve our understanding of the changes in the biological domain as a result of variation in the identified genes, some of which were related to an important QTL. Moreover, this information can be used to develop breeds or crossbreeds that have a better performance and a higher environmental tolerance based on the detected candidate genes.

Genomic signatures of height
The Caspian horse is thought to be an ancient breed native to the north of Iran. This horse is considered as a small horse with height ranging from 9 to 10 hands (Shahsavarani and Rahimi-Mianji 2010). In accordance with these studies, our selection signatures revealed the HMGA2 in a genomic region putatively under selective pressure in all comparisons between Caspian as a small horse and the others with standard height. It has been shown that HMGA2 has a crucial role in the size variation of horses. For instance, investigation of the wither height in Shetland ponies by genomewide association (GWAS) using the GGP equine SNP70 BeadChip detected HMGA2 as a major QTL for this trait (Frischknecht et al. 2015). The study showed that the height of Shetland ponies and other small horses was reduced due to a nonsynonymous mutation in HMGA2. This gene has a well-known role in height determination in other species, including humans (Buysse et al. 2009) and dogs (Rimbault et al. 2013). The GWAS catalog data included 22 different studies that showed association between HMGA2 and the human height trait (Supplementary file, Table S7). This gene also showed a pleiotropic effect on both height and metabolic traits in ponies (Norton et al. 2019). Our results showed that the selection signal for this gene was detected due to a high frequency for allele G in the intragenic SNP MNEc.2.6.81451782.BIEC2-1024200 compared to the other studied breeds. Another gene in the same region overrepresented in all the three Caspian pairwise comparisons was LLPH, previously associated with height in human studies based on the GWAS catalog database. A GWAS in Duroc pigs showed a significant association between the average daily gain and a haplotype block containing LLPH (Quan et al. 2018). Although the four loci can explain up to 83% of the height variance in horses (Makvandi-Nejad et al. 2012), most of the underlying genetic variants that affect the height remain unknown. To pave the way for understanding the molecular mechanisms of height in Iranian horse populations, we suggest the regions that contain HMGA2 and LLPH as strong candidates for the height variation between Caspian and the other Iranian breeds. Beyond these genes, we suggest 38 new putative candidate genes under selective pressure that are related to variation in height traits in humans.

QTLs
Each Iranian horse breed has its own historical background and differs in usage. Finding selection signatures close to QTLs indicates a relationship between the selection for traits and the effects of variation at a locus. Accordingly, we investigated the candidate QTLs and biological pathways within the candidate regions under selection to identify pathways that might have shaped the Iranian horse genome. Some of the SNPs identified under selection were located close to previously reported QTLs related to morphological and behavioral traits in horses, such as wither height, hair density, navicular bone morphology, and alternate gaits. Some of them were located near to QTLs related to adaptation traits such as immunoglobulin level (Smith et al. 2007) (i.e. E and G) and insect bite hypersensitivity (Ablondi et al. 2020). Most of these SNPs were located near QTLs related to genetic disorders such as guttural pouch tympany, osteochondritis dissecans, osteochondrosis, and chronic progressive lymphedema. These results highlight that the suggested genes and QTLs within these categories might be targeted by selection in Iranian horses to adapt to environmental conditions or respond to human-mediated selection.

CONCLUSIONS
In general, exploring the genetic variation between Iranian horse populations evolved under diverse ecological conditions with prominent phenotypic variation suggested a clear genetic division among the breeds, which largely reflects their geographic origins. The use of information obtained from extensive studies in humans, retrieved from the GWAS catalog, can help to better interpret the results of selection signature studies. This information helped us to introduce 38 new candidate genes for height variation in Iranian horse populations in addition to the well-known regions that contain HMGA2 and LLPH as strong candidates. In addition, the identified candidate sweep regions provide a genomewide map of selection signatures in the studied breeds, including genes and QTLs for known morphological, adaptation, and fitness traits. This information may be valuable in formulating genetic conservation and improved breeding strategies for these populations.