Common protein-coding variants influence the racing phenotype in galloping racehorse breeds

Selection for system-wide morphological, physiological, and metabolic adaptations has led to extreme athletic phenotypes among geographically diverse horse breeds. Here, we identify genes contributing to exercise adaptation in racehorses by applying genomics approaches for racing performance, an end-point athletic phenotype. Using an integrative genomics strategy to first combine population genomics results with skeletal muscle exercise and training transcriptomic data, followed by whole-genome resequencing of Asian horses, we identify protein-coding variants in genes of interest in galloping racehorse breeds (Arabian, Mongolian and Thoroughbred). A core set of genes, G6PC2, HDAC9, KTN1, MYLK2, NTM, SLC16A1 and SYNDIG1, with central roles in muscle, metabolism, and neurobiology, are key drivers of the racing phenotype. Although racing potential is a multifactorial trait, the genomic architecture shaping the common athletic phenotype in horse populations bred for racing provides evidence for the influence of protein-coding variants in fundamental exercise-relevant genes. Variation in these genes may therefore be exploited for genetic improvement of horse populations towards specific types of racing.

H orse racing is among the oldest known sports, the general concept of which-horses, either mounted or harnessed, travelling at speed over a certain distance and terrain with the horse finishing first being the winner-has been largely unchanged for millennia 1,2 . The quest for wealth and status in racing traverses diverse cultures and geographies and today mounted horse racing is a globally popular sport. Different populations of horse have been developed for racing through a process of selective breeding for attributes required to excel in specific types of competition; these include breeds that compete in harness and trotting races, and others, including the Arabian, Mongolian and Thoroughbred, that are galloping breeds.
The Mongolian is one of the oldest extant horse populations and although domesticated, most animals are free ranging and experience minimal human intervention 3 . Mongolian horse populations have relatively high genomic diversity compared to other breeds 4,5 , which may reflect the role of the central Asian steppe region as an important centre for horse domestication 6 . The Mongolian is generally classified as a breed in toto but several phenotypically and genetically distinct subpopulations exist 4 , which are used for meat, milk, transport, and racing. Racing in Mongolia is celebrated annually during the Nadaam festival of racing where adult horses race over long distances (25-30 km) and harsh terrain. In recent years, Thoroughbred stallions have been imported for crossbreeding intended to improve speed traits in the racing populations.
The Arabian is also an ancient breed with high levels of genetic diversity 5,7 . Bred for millennia and developed by Bedouin nomads for transport and military use, the Arabian has traditionally excelled in long distance endurance racing (80-160 km), often in extreme climatic conditions. More recently there has been strong selection among subgroups for an aesthetic conformation phenotype, which is valued in show competition, and for short-distance track racing (~1600 m). Among track racing Arabians there is evidence of recent Thoroughbred crossbreeding, presumably for the introduction of speed, with some horses having up to 60% Thoroughbred ancestry 7 . Although considered a highly polygenic trait, sequence variants at several genes have been reported to be directly associated with performance traits in the Arabian breed [7][8][9][10] .
Compared to the Arabian and Mongolian breeds, the Thoroughbred was developed relatively recently during the last three centuries by crossing native British and Irish mares with stallions imported from the Middle East 11 . Most Thoroughbreds compete in races over much shorter distances (1000-3200 m) on maintained track surfaces and are bred for both speed and stamina attributes 12 . Originating from a very small number of founders 13 , and with subsequent restricted gene flow since the formation of the stud book 14 , the Thoroughbred now has very low levels of genetic diversity despite a large global census population size 5,15,16 . These population demographics, coupled with constant human-mediated selection pressures, have resulted in athletic traits with genetic architectures that are especially amenable to modern genomics, particularly because of the high levels of linkage disequilibrium observed across the Thoroughbred genome 17 with haplotypes extending >4 Mb in regions under selection 5,12 . As a result, genome-wide association studies (GWAS) have been successfully deployed to identify quantitative trait loci (QTLs) for complex traits using relatively modest sample sizes [18][19][20][21] . Investigations of genomic targets of selection in the Thoroughbred 5,15,16,22,23 , and functional analyses of gene expression profiles in skeletal muscle [24][25][26] , have identified suites of genes and molecular pathways that are enriched for functions in energy metabolism, muscle contraction, haemostasis, organismal growth and development, lipid metabolism, the mitochondrion, fatty acid metabolism, cardiovascular signalling, cellular stress and injury, and neurotransmitters and other nervous system signalling.
The recently developed omnigenic model proposes that phenotypic outcomes for eukaryotic complex traits are directly shaped by core genes that are embedded in highly interconnected tissue-specific gene regulatory networks, which are substantially modulated by very large numbers of genetic variants of small effect at peripheral genes across the entire genome 27 . The most well studied and arguably core exercise-relevant gene in athletic horses is the myostatin gene (MSTN), where a SINE insertion promoter polymorphism 28 profoundly affects skeletal muscle development 29,30 and distribution of muscle fibre types 5,31,32 . The combined effect of the major MSTN QTL and additive genetic variation across the genome is illustrated by recent work examining the genomic architecture and heritability of optimum race distance in multiple Thoroughbred populations 33 .
In addition to myriad genetic contributions, understanding the biological basis of complex traits is further challenged by the variation in multi-dimensional system-wide endophenotypes that can be dynamic and influenced by the environment, a concept that emerged initially in neurobiological genomics 34 . Like all athletes, in addition to multiple anatomical, physiological, and metabolic processes, environmental factors and interactions also determine the elite athletic phenotype of racehorses 35,36 . Consequently, numerous endophenotypes likely contribute to biological systems relevant to the equine athlete. However, notwithstanding this complex vista, relatively straightforward selection signal detection approaches-without recourse to accurately measured endophenotypes such as hormone levels or other biomarkerscan be used to identify genes or genomic regulatory elements containing sequence variants contributing to recent evolutionary adaptation and important physiological traits in livestock species 37-40 . In the present study, we hypothesised that galloping racing breeds harbour signals of selection that contain genomic loci with sequence variants contributing to racing ability. To refine the selection signals, we assigned functional relevance to SNPs that were in proximity to differentially expressed genes (DEGs) in Thoroughbred skeletal muscle 24 and then interrogated prioritised genomic regions for putative functional protein-coding variants identified from whole genome sequence (WGS) data generated from Asian landrace horse breeds. Based on predicted variant effects and the biological functions of the genes containing them, we hypothesised that these variants may contribute to variation in racing ability among horse breeds. To further explore and validate this hypothesis, we generated genotypes for a panel of these variants in independent sample sets of Thoroughbred and Mongolian Racing horses and in other racing and athletic horse breeds, and examined whether they were associated with racing traits. The overall goal of this research was to identify a set of genetic markers contributing to athletic performance in populations of horses bred for competitive racing. These markers may be used for the improvement of racing populations, including among Mongolian horses, for which there has been less opportunity for systematic pedigree and phenotype selection that is advanced in Thoroughbred and the other more common racing breeds.

Results and discussion
Population structure. An overview of the sequential steps of the study is provided in Supplementary Fig. 1. We first evaluated genetic relatedness and population structure based on genomewide SNP-array derived genotypes among the Racing breeds (Arabian, Mongolian Racing and Thoroughbred) in the context of non-Racing breeds using a principal component analysis (PCA) ( Fig. 1) and an admixture structure plot (Fig. 2). In the PCA plot of PC1 and PC2, Thoroughbreds were tightly clustered with no overlap with any other breed, while Arabian and Mongolian Racing horses were more loosely distributed across PC1 (32.5% of the variance) and there was some overlap between the two populations although a shared recent ancestry was not expected.
Since Thoroughbred admixture has been observed in track racing Arabians 7 and there has been an introduction of Thoroughbred stallions to Mongolia in recent years, we assessed the contribution of Thoroughbred ancestry in the Arabian and Mongolian Racing populations (Fig. 2). In the admixture structure analysis, the Belgian breed was used as a comparator population as it is distantly related to all three Racing breeds 5 . The lowest cross validation error for K = 2-6 modelled ancestral populations was observed for K = 4 (Fig. 2); therefore, this value was considered the most suitable for evaluating ancestry and quantifying admixture 41,42 . The Belgian and Thoroughbred displayed minimal evidence of admixture arising from the other breeds. Five Arabian samples exhibited >50% Thoroughbred genetic ancestry and eight had no Thoroughbred contribution. Except for five animals, the Mongolian Racing samples had some shared ancestry with the other breeds and Thoroughbred ancestry was >50% in one sample. There was minimal sharing of genetic background between the Mongolian Racing and Arabian populations. Based on the structure plot, it is clear that there is Thoroughbred ancestry in many of the Mongolian Racing and Arabian animals, and this is reflected in their position along PC1. Supplementary Data 1 details the individual ancestry contributions at K = 4 modelled ancestral populations for animals in the study.
Separate PCA plots were also generated for the Arabian and Mongolian Racing populations genotyped in this study compared to other Arabian horses 7 ( Supplementary Fig. 2) and Mongolian horses indigenous to Inner Mongolia, China 4 ( Supplementary  Fig. 3). The Arabian horses were genetically diverse and were distributed predominantly across PC2 (16.2% of the variance), which encompassed most of the Arabian variation to the exclusion of the Straight Egyptian. The Mongolian Racing horses did not overlap with the Chinese Mongolian horses and were distributed mainly across PC2 (13.2% of the variance).
In summary, although there was some shared ancestry among the Racing populations, this was not widespread among individual animals suggesting that the observed Thoroughbred admixture most likely reflects recent breeding practices. Therefore, it is unlikely to influence detection of long-standing selection signals due to persistent selection over relatively extended time frames. Furthermore, the composite selection signals (CSS) approach used in this study combines component signals to detect only strongly selected regions that have a common signal across the constituent tests 39 . By contrast there is considerable Thoroughbred gene flow in other racehorse breeds such as Quarter Horse, which has a distinct subpopulation bred for racing 5,16 . Consequently, selection signals identified here among the Racing populations (Arabian, Mongolian Racing, and Thoroughbred) were hypothesised to reveal genomic regions contributing to similar genomic architectures that result from convergent evolution towards the racing phenotype and not gene flow.
Genomic signals of selection among Racing breeds. To identify genomic regions targeted by selection for the racing phenotype, we compared allele frequency distribution variation among two data sets comprising horses from Racing (n = 90) and non-Racing (n = 483) breeds (Supplementary Data 2) using a CSS test that combines the XP-EHH, F ST and ΔSAF statistics 39 . Genomewide distribution of the smoothed CSS score test statistics (-log 10 P) for comparison of the Racing versus non-Racing populations identified 14 genomic regions with signals of selection, defined as clusters of SNPs (>5) among the top 1% SNPs, on ECA1, ECA2, ECA4, ECA5, ECA7, ECA9, ECA14, ECA17, ECA18, and ECA22  There was considerable overlap with selection signals previously reported in a range of athletic horse breeds and the selection signals also overlapped with QTLs identified in GWAS for racing performance traits in Thoroughbreds 43,44 (Table 1). Notably, the second ranked region (ECA7: 40.44-42.86 Mb) containing the NTM gene, coincided with the top GWAS peak identified from a comparison of Thoroughbreds that had raced and Thoroughbreds that had never had a racecourse start 44 , and a selection signal on ECA2 (ECA2: 100.3-101.78 Mb) overlapped with a GWAS peak for measured speed traits in juvenile Thoroughbreds 43 .
Genomic signals of selection in Arabian, Thoroughbred, and Mongolian Racing breeds. We evaluated the overlap between the Racing selection signals and selection signals identified when each of the Racing breeds was analysed separately (  There was also considerable overlap among the Racing selection signals with selection in the Arabian (only), with seven of the 14 clusters also detected in the Arabian versus other breeds analysis ( Supplementary Fig. 5 51 . In human endurance athletes SELENBP1 is differentially expressed in blood in response to administration of human recombinant erythropoietin suggesting a potential role in the regulation of haematopoiesis 52,53 . The top ranked SNP for the overall CSS score and XP-EHH statistic was located within the TBX15 gene that functions in skeletal development of the limb, vertebral column 54 , and shoulder and pelvic girdles 55 . Conformation is a key phenotype on which racehorses are selected and the axis of the pelvis has been shown to be associated with injury risk and performance in Thoroughbreds 56 . TBX15 also plays a major role in skeletal muscle fibre type differentiation and regulates the metabolism of glycolytic myofibres 57 and white adipocytes 58 especially in the browning of adipocytes and has been considered a target for the treatment of obesity 59,60 . In this regard, we previously proposed that adipocyte browning may be a key contributor to the equine athletic phenotype 22 . Furthermore, TBX15 is among the top ranked differentially expressed downregulated genes in Thoroughbred skeletal muscle in response to training 24 , implicating it as central to adaptation to the exercise stimulus. The second ranked CSS region contained the top ranked SNP according to the ΔSAF statistic that was 16 kb from the closest gene, PPARA, which has a major role in exercise and training and is associated with elite human endurance athlete status 61,62 . The highest-ranking SNPs on ECA26 encompassed three genes (NRIP1, BTG3, and CHODL) all of which may be candidate genes for exercise adaptation [63][64][65][66][67][68] . The highestranking SNPs according to the F ST statistic were on ECA7 within the NDUFB7 and CACNA1A genes. NDUFB7 encodes a structural subunit of complex I of the mitochondrial respiratory chain and mutations in the gene have been observed to cause hypertrophic cardiomyopathy and lactic acidosis 69 . In a     was a positive relationship between CSS score and the size of the region identified (r 2 = 0.6). Therefore, the most strongly selected regions (and regions of greatest interest) were the largest and in most cases these regions contained sizeable numbers of genes. Proximity of a top ranked SNP (using CSS or any of the individual test statistics) to a gene may be informative to some degree; however, linkage disequilibrium extends across large regions in horse populations 17 , the equine SNP genotyping arrays exhibit ascertainment bias 73,74 being designed to assay genetic variation among the main European and North American breeds and, as illustrated above, many genes in each detected region exhibit biological functions easily interpretable as affecting exercise physiology. Therefore, to better characterise genes subject to selection for exercise traits, we used additional methods that leveraged transcriptomics data to prioritise SNPs proximal to DEGs in equine skeletal muscle, and WGS data from a cohort of (mostly) Asian horses.
Integrative genomics. Differential patterns of gene expression are the key determinants of phenotype, and integration of transcriptomics and genetic data has been successfully applied to understand the molecular basis of exercise adaptation 75,76 . Here, to refine the SNPs from the population genomics analyses, we integrated these data with DEG sets derived from Thoroughbred skeletal muscle RNA-seq data that distinguish exercise (untrained exercise, UE) and training (trained rest, TR) response transcriptomes 24 . For computational efficiency, the gene lists were refined to include DEGs with P adj. < 10 −12 (UE) and P adj . The R software package gwinteR 77 was used to determine whether genomic regions containing SNPs that are proximal to genes within the DEG sets were enriched for significance in the CSS analysis for Racing versus non-Racing breeds. The numbers of statistically significant SNPs pre-and post-data integration are summarised in Supplementary Data 7. In terms of SNP enrichment (P perm . < 0.1), the integrative analysis was effective for the two input DEG sets. Using a search window that iteratively increased in size from 10-100 kb up and downstream of the genes of interest, a search space of ±100 kb produced the highest number of significantly enriched SNPs with the lowest probability of being significant by chance when compared to a null distribution of 1,000 sets of SNPs randomly sampled from the CSS dataset. SNPs within ±100 kb of a DEG were therefore selected as the target SNP sets to generate new q-values. Gene loci associated with enriched CSS SNPs are provided in Supplementary Data 8. Two genes (LPIN1, LRRC3B) were enriched for SNPs (q < 0.05) in the exercise response (UE) gene set and three genes (CBR4, SYNDIG1, MYOM2) were enriched for SNPs (q < 0.05) in the training response (TR) gene set. Five genes (HMOX1, KTN1, MYLK2, NEO1, TUBA4A) were common to both outputs (q < 0.1) and two of these (NEO1, MYLK2) were located within the selected regions defined by the top 1% CSS SNPs.
Since there was less overlap between the Mongolian Racing selection signals and the Racing selection signals than there was for the Thoroughbred and the Arabian, we separately integrated the Mongolian Racing CSS SNPs in the context of the skeletal muscle DEGs to refine the gene sets (Supplementary Data 9). Again, using 100 kb windows, three genes (PPP2R3A, PELO, GLB1) were enriched for SNPs (q < 0.05) in the exercise response (UE) gene set and three genes (TBX15, KHDRBS3, VEGFA) were enriched for SNPs (q < 0.05) in the training response (TR) gene set (Supplementary Data 10). In addition, three genes (MAP7D1, STAC3, VEGFA) were common among the two outputs (q < 0.1); however, none of these was located within the selected regions defined by the top 1% CSS SNPs. Four DEGs with localised SNP enrichment were located in the top-ranked CSS region for the UE and TR gene sets (APH1A, ATP1A1, UE; CA14, TBX15, TR) and four were among the other CSS regions (ANKRD23, IFI30, RAB30, UE; NXN, TR). The five most significant SNPs for the TR gene set were upstream and within the TBX15 gene. The most significant SNP for the UE gene set was in PPP2R3A.
Whole genome resequencing and variant calling. To identify gene variants with putative functional effects that may be the targets of selection we generated WGS data for 70 horse samples ( Supplementary Fig. 6 Identification of sequence polymorphisms in exercise relevant genes. To generate a panel of sequence polymorphisms to test for alleles with significant deviations in frequency between different subgroups of horses, we focused on identifying protein-coding variants in candidate genes within the selection signal regions and genes identified from the integrative analysis. We focused on polymorphisms (SNPs and small indels) with moderate minor allele frequencies (MAF >0.1) as we did not expect this approach to identify rare small-effect variants. In addition, we did not expect to identify severely deleterious mutations, and therefore the search was not limited to variants with a predicted high effect on modifying gene function.
For the Racing breeds, the eight highest ranked selected regions and 11 significant regions from the integrative analysis (five common to UE and TR, including two that also overlapped with CSS; three unique to UE; three unique to TR) were used to search for putative functional variants with the WGS data (Table 3). Among the searched regions, for validation we chose high-effect variants in four candidate genes and moderate effect variants in 14 candidate genes (Supplementary Data 15). Three regions did not contain any variants that met the prioritisation criteria. Notably absent were variants in the top ranked CSS region on ECA1 that contained PCDH15 and ZWINT. PCDH15 has been associated with lipid phenotypes 78 , but is best known for association with deafness 79 and is not a compelling candidate gene. On the other hand, the ZW10 interactor protein, encoded by ZWINT, functions in neurotransmitter release and in rodents mediates negative behaviour induced by neuropathic pain 80 , which may be relevant to exercise 81 . We previously reported a sequence tag <5 kb from ZWINT among the most differentially expressed downregulated transcripts in the training-response skeletal muscle transcriptome in the horse 25 implicating the locus as functionally relevant to exercise. The absence of identified gene-specific variants in this region may be explained by the focus here on the identification of common protein-coding variants, which precludes the identification of sequence variants in genomic regulatory elements, copy number variants, and chromatin state modifications that also contribute to the gene regulatory networks underlying complex traits 82,83 .
A similar approach was taken to identify putative functional variants within regions identified for the Mongolian Racing analyses. For Mongolian Racing the seven highest ranked selected regions and seven significant regions from the integrative analysis (one common to UE and TR that also overlapped with CSS; four unique to UE including one that overlapped with CSS; two unique to TR) were prioritised (Table 3). For validation, we chose high effect variants in four candidate genes and moderate effect variants in eight candidate genes (Supplementary Data 15).
In total, 32 polymorphisms in 27 genes were selected for validation genotyping on the basis that the variants disrupt the sequence of proteins with central roles in exercise physiologyincluding key functions associated with muscle, heart, angiogenesis/blood, limb development, metabolism, and neurological tissues. The known biological functions of the genes are summarised in Supplementary Note 1. Of the 32 polymorphisms, 23 SNPs met the assay design criteria and passed post-genotyping quality control and were used in tests of genetic association. Genotypes were generated for independent validation sample sets that were not used for the selection signals analyses.
Genetic association with the racing phenotype. We hypothesised that genetic variants targeted by selection for the racing phenotype segregate among horse breeds to influence underlying endophenotypic variation. Genotypes for the panel of 23 SNPs were generated for n = 267 horses from six breeds (Arabian, French Trotter, Mongolian Racing, Quarter Horse, Standardbred, and Thoroughbred) chosen to represent racing breeds, and n = 249 horses from eleven breeds (putatively ancestral to Thoroughbred-Akhal Teke, Egyptian Arabian, Moroccan Barb; Chinese Mongolian landrace-Baerhu, Baicha Iron Hoof, Keerqin, Wushen, Wuzhumuqin; sport horses-Connemara, Irish Draught, Dutch Warmblood) representing non-racing breeds (Supplementary Data 16). Additional detail for the breeds is provided in Supplementary Note 2.
In tests of genetic association, SNPs in nine genes were significantly (Bonferroni-adjusted P < 3.57 × 10 −3 ) associated with the racing phenotype (Table 4). Eight were missense variants predicted to have a moderate effect on the protein and one (SLC16A1) that introduces a stop codon was predicted to have a high effect on the protein. We did not expect to identify loss of function mutations, since we expected here to detect alleles that are advantageous for exercise. The introduction of a stop codon may not always disrupt the function of a protein if there is limited truncation or if there is stop codon read through 84 .
Biological functions relevant to exercise among genes significantly associated with racing. The functional relevance of this gene set is supported by the integrative analyses in which three of the genes (KTN1, MYLK2, and SYNDIG1) were enriched for SNPs among DEGs in the skeletal muscle exercise and training response. A literature search and review of associated gene ontology functions, indicated that this set of genes have roles in muscle (HDAC9, MYLK2), metabolism (FASKD1, G6PC2, GLB1, SLC16A1) and neurobiological (KTN1, NTM, SYNDIG1) functions that are linked to exercise-relevant phenotypes. Skeletal muscle is a highly plastic tissue that responds to exercise and training stimuli by increasing muscle mass and changing fibre type composition with concomitant mitochondrial functional adaptations 85 . Here, we identified two genes relevant to muscle function that were significantly associated with the racing phenotype, and we consider to be core genes. The HDAC9 gene encodes a protein that inhibits skeletal myogenesis and is involved in heart development [86][87][88][89] . In Thoroughbred skeletal muscle HDAC9 is among the top five most significant DEGs downregulated in the exercise response (log 2 FC = −2.67, P = 1.21 × 10 −20 ) 24 . In humans, HDAC9 gene variants are associated with the maximal oxygen uptake (VO 2max ) response to training 90 . Among the racing breeds, the Thoroughbred had the highest frequency (0.65) of the A-allele, which was more than twice the frequency of the allele among the other racing breeds (mean = 0.31) and 3.4× that among the sport horse breeds (0.19). Allele frequencies for all SNPs in each breed are shown in Supplementary Data 17.
MYLK2 encodes a myosin light chain kinase (MYL2) expressed in skeletal muscle. The enzyme has a critical role in muscle contraction, and functions in neuromuscular synaptic transmission, skeletal muscle satellite cell differentiation, regulation of muscle filament sliding and skeletal muscle cell differentiation 91,92 . MYLK2 was the most significantly downregulated gene (log 2 FC = −1.31, P = 1.37 × 10 −22 ) among the 3,241 DEGs in Thoroughbred skeletal muscle following exercise and ranked 6 th following a period of training (log 2 FC = −1.04, P = 1.13×10 −6 ) 24 , higher than MSTN (14 th , log 2 FC = −2.56, P = 1.43 × 10 −6 ), a gene with a wellestablished functional role in exercise 19,29,31,33 . In humans, genetic variants in MYLK are associated with phenotypic responses to exercise-induced muscle damage 93 . Here, the A-allele that was significantly different between racing (0.38) and non-racing breeds (0.25), had, among the racing breeds, the highest frequency in Arabian (0.63) and French Trotter (0.45) and the lowest frequency in Quarter Horse (0.28) and Standardbred (0.28) (Supplementary Data 17). Based on the considerable functional evidence, we propose that genetic variation in HDAC9 and MYLK2 has a critical role in determining the muscle phenotype of racehorses.
The metabolic properties of skeletal muscle are largely influenced by the proportion of slow (type I) and fast (type II) muscle fibres, which are defined by the myosin heavy chain isoforms and characterised by the different densities and functional properties of mitochondria 94 . Within the different fibre types, the glycolytic and oxidative pathways are tightly regulated to ensure an adequate supply of ATP to meet energy demands. The G6PC2 gene encodes a major component of glycolysis [95][96][97] . Protein-coding variants in the gene are associated with fasting glucose levels in humans and there is strong evidence that G6PC2 is an effector gene for glucose regulation 97 . Among breeds, the G-allele occurred at the highest frequency in Thoroughbred (0.95) and was lowest in Connemara (0.50). The glycolytic requirements for high intensity exercise are likely responsible for the observed variation in this gene among the breeds. The FASTKD1 and PPIG genes are also located in the region exhibiting the selection signal for G6PC2. The FAST kinase domains 1 protein, encoded by FASTKD1, supports mitochondrial homeostasis, and has a critical protective role against oxidant-induced cell death [98][99][100][101] . However, the strongest association in racing breeds was with the G6PC2 SNP and its wellestablished biological function in the regulation of glucose suggests that it could underpin the selection signal at this locus.
One of the best characterised genes for racing performance in Arabian horses is the SLC16A1 gene encoding the solute carrier family 16 member 1 protein that catalyses the movement of lactate and pyruvate across the plasma membrane 8,9,102 . In humans, genetic variants in the gene are used to predict athletic performance, in particular high-intensity exercise, and power ability 103,104 . Here, we have identified a novel variant that is predicted to have a major effect on the resulting protein through the introduction of a stop codon. The value of this variant in prediction of racing performance among Arabian horses requires testing in horses phenotyped for economically relevant racing traits.
Neurobiological functions have regularly featured in equine exercise transcriptomics and genomics research 24,43,105 . Here we identified SNPs in three genes with functions in neurobiology, KTN1, NTM and SYNDIG1. Of particular note is NTM encoding neurotrimin, which functions in brain development, regulates neural growth and synapse formation, and influences learning and memory [106][107][108][109][110][111] . A GWAS in Thoroughbreds previously identified this locus as the most significantly associated with the number of racecourse starts 44 . NTM also ranks among the top 10 genes positively selected during horse domestication 112 suggesting that equine neurological systems associated with domestication may overlap with adaptive traits that are required for racing. Here, the NTM SNP was the most significantly associated (P = 7.49 × 10 −14 ) with the racing breeds and among all breeds the highest frequency of the racing allele was in the Thoroughbred (0.89). In humans, KTN1 gene variants are strongly associated with KTN1 gene expression in the putamen and the volume of the putamen 113 , a region of the forebrain belonging to the basal ganglion that influences motor behaviours including motor planning and execution, motor preparation, amplitudes of movement and sequences of movement [113][114][115][116][117][118][119][120] . Here, the KTN1 G-allele was <1% in Thoroughbreds but had an average frequency of 0.22 in the other racing breeds, was 0.34 in the ancestral breeds and 0.21 in the sport horse breeds. Selection for the racing allele in breeds other than the Thoroughbred may be valuable in improving locomotor functions critical to racing. For SYNDIG1, the product of which regulates the development of excitatory synapses [121][122][123] , we observed that selection may already have fixed the exercise-favoured variant in racing breeds; the T-allele was absent in Thoroughbred, Arabian and Akhal Teke and was observed at a low frequency in the other breeds, with the highest occurrence in Connemara (0.29) and Irish Draught (0.25).
Genes associated with racing in Mongolian horses. Considering the difference in the selection signals profile of the Mongolian Racing horses (compared to the results from the Racing breeds), we also performed tests of genetic association for eight SNPs in a cohort of Mongolian horses selected by herdsmen for racing by comparing the genotypes to a set of Chinese Mongolian horses that are not used for racing (Supplementary Data 16). The GLB1 SNP was significantly (Bonferroni-adjusted P < 0.006) associated with the racing phenotype among Mongolian horses ( Table 5). The protein encoded by GLB1, beta-galactosidase, has a role in several metabolic pathways and is the most widely used biomarker for senescent and aging cells 124 . There are a number of GLB1 related disorders 125 including a disruption of normal skeletal morphologies 126

and cardiomyopathies.
Genes associated with racing performance in Thoroughbred horses. To test for genetic association with racing traits among Thoroughbreds, we partitioned a large archive of samples (n = 1134) into three groups: horses classified as elite, horses that had raced but had never won a race, and horses that were unraced (Supplementary Data 18). Among a cohort of horses that had raced in North America, the MYLK2 SNP was significantly (P < 0.005) associated with elite racing performance, but it was not associated with the trait among Australian (P = 0.43) or European (P = 0.47) horses (Supplementary Data 19). We have previously observed regional-specific variation for racing performance among Thoroughbreds 23 , which may be due to different selection pressures for the various dynamics in each racing ecosystem. Among European Thoroughbreds, the NTM SNP was suggestive of association with the occurrence of a racecourse start (P = 0.01), and although it did not meet the threshold for significance following correction for multiple testing, the occurrence of this locus in a previous GWAS 44 , and the observation of the highest frequency of the SNP among Thoroughbreds, strongly implicates NTM as an economically important gene in the Thoroughbred.

Conclusions
Adaptation driven by strong directional selection at key loci is likely to be particularly important in livestock species that are subject to management-based selection decisions 37,127 . Here, we provide evidence for major-effect variants in shaping the racing phenotype in horse populations. We have demonstrated that allelic variants in a core set of fundamental exercise-relevant genes segregating among horse populations likely underpin an array of adaptations required for racing. Genes including G6PC2, HDAC9, KTN1, MYLK2, NTM, SLC16A1 and SYNDIG1, with roles in muscle, metabolism, and neurobiological functions, appear to be central to shaping the racing phenotype in horses. These results are likely to inform genome-enabled improvement of horse racing populations and highlight genes of interest for athletic traits in other species for which exercise adaptation is desired.

Methods
Ethics statement. Samples genotyped in this study were collected with informed owner's consent for commercial genetic testing and approved for use in research. As such, institutional animal research ethics was not required. Approval for collection and the movement of genetic material for the whole genome sequencing analyses was granted by Inner Mongolia Agricultural University Animal Research Ethics Committee and Mongolian University of Science and Technology.
Selection signals analysis samples. Mongolian Racing: Genotypes were generated for n = 24 Mongolian Racing horse samples using the GGP Equine SNP70 genotyping array and were used as the primary Mongolian Racing cohort (Supplementary Data 2). Samples were collected in Khentii province, Mongolia, in 2018. No pedigree/identification or racing performance data was available for the horses. Horses were selected for sampling based on the herdsmen's knowledge of relatedness, from herds of horses bred and owned by the Ajnai Sharga Horse Racing team. Mongolian populations that had been previously genotyped 4 were used for the breed-specific PCA and included horses from the Abaga Black, Baicha Iron Hoof, Sanhe, Wushen and Wuzhumuqin populations.
Arabian: Genotypes were generated for n = 30 Arabian horse samples using the GGP Equine SNP70 genotyping array and were used for the selection signals analysis only. Samples were collected in United Arab Emirates in 2014. All horses were bred for racing competitions. No additional pedigree/identification or racing performance data was available for the individual horses. To confirm the racing phenotype for this sample, a PCA was performed using this sample and an additional n = 152 Arabian horses that had been previously genotyped 7 and had recorded performance uses (racing, endurance, show). The sample of horses genotyped in this study was confirmed to be distributed in the PCA space shared predominantly with racing Arabians (Supplementary Fig. 7). For the breed-specific PCA ( Supplementary Fig. 2) Arabian horses from different geographic regions were used. Although there are numerous different subtypes of Arabian horses based on geographic origin and performance use, for simplicity, the racing Arabian horses used in this study are referred to as Arabian. Additional detail for the breed is provided in Supplementary Note 2. Thoroughbred: The Thoroughbred cohort (n = 36) had been previously genotyped 5

and included horses originating in USA and Europe (Supplementary Data 2).
Racing breeds: For the CSS analysis the Mongolian Racing, Arabian and Thoroughbred populations described above were used as the Racing cohort.
Non-Racing breeds: Horses representing 21 diverse horse breeds that have not been bred for racing were used as the comparator cohort to represent non-Racing breeds for the CSS analyses. Samples had been previously genotyped 5  Whole-genome resequencing samples. Horses from breeds indigenous to Mongolia and China, and horses from breeds imported to China were used for generation of whole-genome sequence (WGS) data. Samples from indigenous horses (n = 50) were obtained from rural localities across China and Mongolia ( Supplementary Fig. 6). Additional samples (n = 20) were collected from Akhal-Teke (n = 5), Arabian (n = 5), Shetland Pony (n = 5), Friesian (n = 2), Clydesdale (n = 2) and Russian Draft (n = 1) (Supplementary Data 11). No pedigree information was available for these horses, but sampling of related animals was avoided based on the information provided by horse owners and local herdsmen. Genotype QC and datasets. In total, SNP array-derived genotypes for n = 574 horses from 24 distinct populations (Supplementary Data 2) were used, including n = 30 Arabian and n = 24 Mongolian Racing horses that were genotyped using the GGP Equine SNP70 genotyping array in this study. The n = 30 Arabian horses were selected from a set of n = 70 horses based on limited relationship (π < 0.25). Genotype data (50,042 autosomal SNPs generated using the Illumina EquineSNP50 genotyping array) for n = 520 horses were accessed from www.animalgenome.org/ repository/pub/UMN2013.0125 5 including Thoroughbred and horses representing populations that were not specifically bred for racing (non-Racing). These data were merged with genotypes generated in this study and SNP QC and filtering were performed across populations on the merged data set. Individual SNPs with >10% missing data and MAF < 0.01 were removed, resulting in 36,767 SNPs for analyses.
Genotypes for n = 30 Arabian horses genotyped in this study were combined with publicly available genotypes for n = 378 Arabian horses obtained from the Mendeley Data resource (https://doi.org/10.17632/mkk5khxrbp.3) 7 . After removal of individual SNPs with >10% missing data and MAF < 0.01, 35,292 SNPs remained for PCA analysis.
Genotypes for n = 24 Mongolian Racing horses genotyped in this study were merged with publicly available genotypes for n = 100 Chinese Mongolian horses obtained from the Open Science Framework (https://osf.io/2xvqf/quickfiles) 4 . After removal of individual SNPs with >10% missing data and MAF < 0.01, 60,994 SNPs remaining for PCA analysis.
Principal component analysis. To visualise genetic relatedness among the populations, principal component analysis (PCA) was performed using smartPCA from the EIGENSOFT package (version 4.2) 128 . One outlier (Shire, ID:SH144) was identified and excluded from further analyses. The outlier was separated from the breed cluster and had also been identified as an outlier in the study from which the data were obtained 5 .
Admixture analysis among Arabian and Mongolian Racing populations. For the analysis of population substructure, model-based clustering was performed using the software package ADMIXTURE 129 . The model assigns ancestry based on a predefined number of K ancestral populations. Individuals are assigned to K clusters based on allele frequencies and the proportion of ancestry from each population is estimated. The analysis was performed for K ranging from 2-6. Unsupervised modelling was used to predict allele frequencies in four ancestral genetic lineages (K = 4) and each animal's genome was partitioned and proportionally assigned to one of the four lineages. The outputs from the analysis were visualized using pophelper 2.3.1 130 .
Identification of selection signals. Composite selection signals (CSS) analyses 39,131,132  The CSS approach is among several composite approaches to successfully identify genes under selection for monogenic and polygenic traits in livestock 39,133 . CSS uses fractional ranks of constituent tests allowing a combination of the evidence of historical selection from different population genetic tests of selection. We used the fixation index (F ST ), the change in selected allele frequency (ΔSAF) and the cross-population extended haplotype homozygosity (XP-EHH) tests combining each test statistic into one composite CSS statistic for each SNP. F ST statistics were computed as the differentiation index between the target population/ s of interest (as selected) and the contrasting/reference population/s (as nonselected), and the XP-EHH and ΔSAF statistics were computed for the selected population/s against the reference population (as non-selected).
The CSS statistics were computed as follows: For each constituent method, test statistics were ranked (1, …, n) genome-wide on n SNPs. Ranks were converted to fractional ranks (r´) (between 0 and 1) by 1/ (n + 1) through n / (n + 1). Fractional ranks were converted to z-values as z = Φ−1(r´), where Φ−1(⋅) is the inverse normal cumulative distribution function. Mean z scores were calculated by averaging z-values across all constituent tests at each SNP position and P-values were directly obtained from the distribution of means from a normal N (0, m -1 ) distribution where m is the number of constituent test statistics. Log-transformed P values (-log 10 of P values of the mean z-values) were declared as CSS. To identify significant selection signals CSS scores were plotted against the genomic positions and the individual test statistics were then averaged across SNPs within 1 Mb sliding windows to reduce spurious signals (smoothed CSS score). Clusters of >5 SNPs among the top 1% SNPs were defined as signals of selection.
Integration of SNPs with skeletal muscle gene expression data. To integrate the SNP data arising from the CSS analyses with gene sets generated from functional genomics data analyses 24 we used an R software package, gwinteR 77 as follows: (1) a set of significant and non-significant SNPs (named the target SNP set) was collated across all genes in each gene set at increasing genomic intervals upstream and downstream from each gene inclusive of the coding sequence (e.g., ±10 kb, ±20 kb, ±30 kb… …±100 kb); (2) for each genomic region, a null distribution of 1000 SNP sets, each of which contains the same number of total significant and non-significant combined SNPs as the target SNP set, was generated by resampling with replacement from the search space of the total population of SNPs in the CSS SNP data set; (3) the nominal (uncorrected) CSS P values for the target SNP set and the null distribution SNP sets were converted to local FDRadjusted P values (P adj. ) using the fdrtool R package (version 1.2.15) 134 ; (4) to test the primary hypothesis for each observed genomic interval target SNP set a permuted P value (P perm. ) was generated based on the proportion of permuted random SNP sets where the same or a larger number of SNPs exhibiting significant q-values (e.g. q < 0.05 or q < 0.10) were observed; (5) a summary output file of all SNPs in the observed target SNP set with genomic locations and q-values was generated for subsequent investigation.
For the Racing versus non-Racing CSS data set, there were 36,768 autosomal SNPs with nominal P values that could be used for the integrative genomics analyses. For the integrative analyses of functional genomics outputs with the CSS data, two different subsets of differentially expressed genes (DEGs) in skeletal muscle were used: 1) trained rest (TR) and 2) untrained exercise (UE), representing the Thoroughbred skeletal muscle transcriptomic response to training (TR) and exercise (UE) 24 . The gene sets that were used were filtered with P adj. <10 −4 (TR) and <10 −12 (UE) to ensure manageable computational loads, resulting in 230 (TR) and 407 (UE) input genes on autosomal chromosomes for integration with the CSS SNPs. The input gene sets can be found in Supplementary Data 5 and Supplementary Data 6.
For the Mongolian Racing versus other breeds CSS data set, there were 30,003 autosomal SNPs with nominal P values that could be used for the integrative genomics analyses. The same DEG lists were used for the integrative analyses.
Whole genome resequencing sample collection, DNA extraction, and sequencing. Blood samples were collected from n = 70 individuals comprising horses indigenous to Mongolia and China and imported horse breeds. Genomic DNA was extracted following the standard phenol-chloroform extraction procedure. For genome sequencing, a total amount of 1.5 μg of genomic DNA from each sample was used to construct a library with an insert size of~350 bp. Paired-end sequencing libraries were constructed according to the manufacturer's instructions (Illumina Inc., San Diego, CA, USA). The DNA sample was fragmented by sonication to a size of 350 bp, then DNA fragments were end polished, A-tailed, and ligated with the full-length adapter for Illumina sequencing with further PCR amplification. PCR products were then purified, and libraries were assessed for size distribution using the Agilent 2100 Bioanalyzer and quantified using real-time PCR and sequenced on the Illumina HiSeq PE150 platform (Illumina Inc.).
Whole genome resequencing quality control. In total~97 Gb raw sequence data was generated for each sample with the quality of the bases 96% and 90.92% for ≥Q20 and ≥Q30, respectively. Approximately 652 Mb clean paired-end reads with 30× coverage per horse (24-36×) were retained after removing low quality reads (Supplementary Data 12). Reads were removed with ≥10% unidentified nucleotides (N); >10 nt aligned to the adaptor, allowing ≤10% mismatches; >50% bases having phred quality <5; and putative PCR duplicates generated in the library construction process, which mainly result from base-calling duplicates and adaptor contamination.
Whole genome resequencing reads mapping, SNP and INDEL calling. The remaining high quality paired-end reads were mapped to the horse EquCab 3.0 reference genome using BWA (Burrows-Wheeler Aligner) (Version: 0.7.8) with the command 'mem -t 4 -k 32 -M' 135 . In order to reduce mismatch generated by PCR amplification before sequencing, duplicated reads were removed using SAMtools. After alignment, SNPs were called on a population scale using a Bayesian approach as implemented in the package SAMtools 136 . Genotype likelihoods from reads for each animal at each genomic location, and the allele frequencies in the sample were then calculated using a Bayesian approach. The 'mpileup' command was used to identify SNPs with the parameters as '-q 1 -C 50 -t SP -t DP -m 2 -F 0.002'. Then, to exclude SNP calling errors caused by incorrect mapping or indels, only highquality SNPs (coverage depth ≥3 and ≤50, RMS mapping quality ≥20, maf ≥0.05, miss≤0.1) were retained for subsequent analysis.
SNP calling was also performed using GATK v4.1.2.0 137 . The process was as follows: (1)  Similar to SNP calling, the calling of indels was conducted using SAMtools with minimum depth ≥3 and GQ >20, and only indels <50 bp were retained.
Prioritisation of novel variants. The functional effects of the variants were predicted according to the Ensembl (version 101) annotation of EquCab3. Functional consequences of the variants (including SIFT scores 138 for missense variants) were predicted with the Ensembl Variant Effect Predictor (VEP, version 91.3 139 ).
These data formed the reference files that were screened for putative causative / functional variants in regions / genes of interest. Prioritisation of genes and variants was performed with a particular focus on (1) predicted effect of the variant, where 'high' effect was selected where possible, (2) biological function of the gene relevant to exercise, (3) minor allele frequency ≥0.1, and (4) high quality score (most had QUAL = 999) 140 . All genes were considered in the regions demarked by the CSS analyses, but generally only DEGs were considered in regions (±100 kb from gene start/end) demarked by the integrative analyses. Variants were only considered if they were in known genes. Predicted 'high' effect variants included the annotations splice_acceptor_variant&intron_variant, frameshift_variant, splice_acceptor_variant, and stop_gained, and predicted 'moderate' effect variants included the annotations missense_variant and inframe_deletion.
Matrix-associated laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) assay design for SNP genotyping. Assay development and genotyping was performed at Neogen Genomics (Lincoln, Nebraska, USA) using the MassARRAY platform and iPLEX GOLD chemistry according to the manufacturer's protocol (Agena Bioscience, San Diego, California, USA). Agena Design Suite software developed by the manufacturer was utilized to design multiplex assays. The variants of interest were among a larger set of 150 markers of interest (SNPs, MNPs) (designed for other analyses not included here) that were split into four separate sets of 48, 47, 35, and 20 markers each. The four multiplex assays were run on the genomic DNA provided for the animals, data generated, and quality check metrics applied.
Tests of genetic association. SNP genotype data for all samples were merged and converted to Plink .bed format. The data were analysed using PLINK 1.9 141 . SNPs with a call rate <80% and samples with a call rate <90% were excluded from the analysis. Case-control tests of genetic association were performed to compare the racing breeds horses n = 267 to non-Racing breeds horses n = 249 and to compare the Mongolian Racing horses n = 46 to non-Racing Chinese Mongolian breeds horses n = 121 (Supplementary Data 16). The Thoroughbred cohort was divided by geographic region, Europe, Australia, and North America. The quantitative trait, number of racecourse starts, and the binary trait, Elite vs non-Elite, were tested for each cohort with sex included as a co-variate (Supplementary Data 18).
Statistics and reproducibility. The composite selection signals (CSS) approach which combines the statistical outputs from the F ST , ΔSAF and XP-EHH tests into one composite statistic for each SNP was used for the identification of selection signals. Significance was assigned to SNPs within clusters of >5 SNPs among the top 1% of CSS scores. The reproducibility of the results is demonstrated by the identification of several selection signals in the Thoroughbred that were previously identified using different statistical methodologies (d i , H, H 12 , and Tajima's D) 5,7 . Overlaps with reported selection signals in other horse breeds are shown in Table 1.
The R package gwinteR was used for the integration of SNPs with DEGs. A set of significant and non-significant SNPs from the CSS analysis was collated across all genes in the DEG gene set at increasing genomic intervals (e.g., ±10 kb, ±20 kb, ±30 kb… …±100 kb) upstream and downstream from each gene. A null distribution of 1,000 SNP sets, each of which contains the same number of total significant and non-significant combined SNPs as the target SNP set, was generated for each genomic region by resampling with replacement from the search space of the total population of SNPs in the CSS data set. The nominal (uncorrected) CSS P values for the target SNP set and the null distribution SNP sets were converted to local FDR-adjusted P values (P adj. ) using the fdrtool R package (version 1.2.15) 134 . To test the primary hypothesis for each observed genomic interval target SNP set a permuted P value (P perm. ) was generated based on the proportion of permuted random SNP sets where the same or a larger number of SNPs exhibiting significant q-values (e.g. q < 0.05 or q < 0.10) were observed. A summary output file of all SNPs in the observed target SNP set with genomic locations and q-values was generated.
Tests of genetic association were performed by comparing allele frequencies in a chi-square (1df) test for the binary traits, and a Wald test for the quantitative trait. Sex of the horse was included as a covariate. P-values were adjusted for multiple testing by applying a Bonferroni correction. To demonstrate the reproducibility of the results, samples were independent of the discovery sample sets (CSS and WGS), and for the association with racing, racing breeds (Quarter Horse, French Trotter, Standardbred) that were not included in the discovery sample set were included in the racing cohort.
Reporting summary. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
SNP array derived genotypes generated in this study have been deposited in the European Variation Archive with the accession IDs PRJEB55561 (Project), ERZ12817059 (Mongolian horse analysis), and ERZ12817060 (Arabian horse analysis). The whole genome sequence data have been deposited in the Sequence Read Archive with the BioProject ID: PRJNA867509. The source data for Fig. 2 is available at Source data is available at https://doi.org/10.5061/dryad.g79cnp5sm. The SNP genotype data generated for the validation study are subject to the following licenses/restrictions: The phenotype and genotype data are the property of Plusvital Ltd. and subject to a confidentiality agreement with the animal owners.