A genome-wide scan for candidate lethal variants in Thoroughbred horses

Domestic animal populations are often characterised by high rates of inbreeding and low effective population sizes due to selective breeding practices. These practices can result in otherwise rare recessive deleterious alleles drifting to high frequencies, resulting in reduced fertility rates. This study aimed to identify potential recessive lethal haplotypes in the Thoroughbred horse breed, a closed population that has been selectively bred for racing performance. In this study, we identified a haplotype in the LY49B gene that shows strong evidence of being homozygous lethal, despite having high frequencies of heterozygotes in Thoroughbreds and other domestic horse breeds. Variant analysis of whole-genome sequence data identified two SNPs in the 3′UTR of the LY49B gene that may result in loss of function. Analysis of transcriptomic data from equine embryonic tissue revealed that LY49B is expressed in the trophoblast during placentation stage of development. These findings suggest that LY49B may have an essential, but as yet unknown function in the implantation stage of equine development. Further investigation of this region may allow for the development of a genetic test to improve fertility rates in horse populations. Identification of other lethal variants could assist in improving natural levels of fertility in horse populations.

The identification of high frequency lethal variants is of particular interest in domestic horse populations. Although a recent study has identified some candidate mutations 28 , to date there has been no published comprehensive characterisation of common embryonic lethal alleles in horse populations. Despite the large variety of domestic horse breeds found throughout the world, many breeds suffer from low within-breed diversity and small N e [29][30][31] . Some horse breeds with large census population sizes also experience low genetic diversity due to intense artificial selective breeding practices and closed population structures 30,31 . Maintaining good fertility rates is particularly important for horse populations due to the seasonal nature of breeding and the low individual fertility output, as mares produce only one foal from an eleven month gestation period 32 . Despite the extensive use of hormonal therapies to increase covering success in many domestic horse populations, per cycle pregnancy rates in some breeds only average around 65%, suggesting the presence of unknown variables that may reduce fertility 33 .
In this study, we aimed to characterise variants at high frequencies that may cause lethality in the Thoroughbred horse population. The Thoroughbred breed is of particular interest due to the closed population structure since the foundation of the studbook in the eighteenth century 34 . The population has since been intensely selected for the improvement of athletic abilities 35,36 , resulting in contemporary Thoroughbred horses being characterised by high levels of inbreeding and a small N e 31,[37][38][39] . Due to selective breeding practices, all Thoroughbred horses can trace their ancestry back to a small number of individuals from the foundation of the breed 37,38 . Genetic diversity in the Thoroughbred breed has been reduced in recent decades due to the increased commercialisation of popular stallions providing large genetic contributions to the population 40 . Although such practices are in line with selective breeding principles 41 , they could also inadvertently increase the frequency of embryonic lethal variants in the population. Reproductive technologies such as artificial insemination are banned in the Thoroughbred population, making the maintenance of high levels of natural fertility imperative. Additionally, Thoroughbred horses have been used as foundation stock for other popular horse breeds including The Quarter Horse, Standardbred, and many Warmblood breeds 30 . Therefore, identification of lethal variants in Thoroughbreds is also likely to assist in the breeding management of these populations. We also aimed to determine the frequency of any potentially lethal variants identified in the Thoroughbred population in other horse breeds and examine their transcriptomic profile in embryonic tissue.

Results
identifying candidate lethal Snps at high frequencies in thoroughbred horses. Analysis of genotype data from Thoroughbred horses (n = 156) identified only two adjacent, linked SNPs that significantly deviated from the Hardy-Weinberg equilibrium with an absence of homozygotes (Table 1). Under Hardy-Weinberg equilibrium, seven minor allele homozygotes were expected for both of these SNPs in the dataset. Genotype data from Japanese Thoroughbred horses (n = 370) also showed an absence of homozygotes for these SNPs (Table 1). In this dataset, the expected number of minor homozygotes for these SNPs under Hardy-Weinberg equilibrium was nine. Despite a complete absence of homozygotes, almost 35% of Thoroughbreds across both of these datasets were heterozygous for this two-SNP haplotype. These SNPs also showed an absence or reduction of minor homozygotes in genotype data for other domestic horse breeds (n = 2030, Table 1).
These two candidate SNPs mapped to the coordinates of 6:38278097 (rs68661802) and 6:38278874 (rs68663106), which are found in an intronic region of the LY49B gene on chromosome 6. This gene is part of the LY49 gene family, which plays an important role in innate immunity. There are five functional members of the LY49 gene family in Equus caballus, all of which closely grouped together on chromosome 6. Since both of the SNPs mapped to a non-coding region of the LY49B gene, the likelihood of either being a causal variant for lethality is low.  (Table S1).
identifying candidate causal variants using whole genome sequence data. To further investigate SNP frequencies in this region, variants were called from whole-genome sequence data of 90 domestic horses. The two SNPs identified in the preliminary analysis showed a complete absence of homozygotes for their minor alleles in these individuals (Table S3). Additionally, a number of variants closely linked to these SNPs were identified (Table S3, Figure S1). Annotation of these loci using SIFT 42 identified three variants that may result in changes to protein structure or expression, so these represent the most likely candidates to cause lethality in homozygous state (Fig. 1). The first of these variants, 6:38282610G > A (rs68663123), was located in an exonic region of the LY49B gene and resulted in an amino acid change from a phenylalanine to a serine residue. This substitution is located next to a tryptophan residue that appears to be highly conserved across members of the LY49 family and across species.  www.nature.com/scientificreports/ However, there is little conservation of the phenylalanine residue across taxa; some species have a phenylalanine and others a serine at this position. This SNP is annotated as being "tolerated" in SIFT. Two other variants that were closely linked to the candidate SNPs (6:38276742A > T and 6:38276955G > A) were found within the 3′UTR (3′untranslated region) of the LY49B gene. Alignment of the 3′UTR of the five functional LY49 genes in Equus caballus revealed that the region containing the SNP 6:38276955G > A (rs1139567427) is highly conserved in all members of the LY49 gene family ( Table 2). This region may be important for mRNA stability and translation into a functional protein. The other variant 6:38276742A > T (rs1137325172) was found in an AU-rich region at the end of the LY49B mRNA transcript, which is often associated with polyadenylation and post translation stability. transcriptomic analysis of RnA sequence data. Measurable levels of LY49B mRNA were not detected in equine trophectoderm tissue collected on day 16 of development. However, LY49B mRNA was observed in trophectoderm tissue collected on days 23 and 24 of development (Table 3). Additionally, LY49B mRNA transcripts were detected in microarray data from equine chorion and chorionic girdle tissue between days 27 and 34 of development (Table S4). Inner cell mass tissue collected on days 15, 22 and 25 of development did not show any measurable transcription of LY49B (Table 3). The genotypes of the candidate SNPs in the mRNA samples analysed were unknown.

Discussion
Analysis of genotype data identified a two-SNP haplotype as a strong candidate for harbouring a variant that causes lethality in homozygous state. The SNPs identified in this preliminary analysis mapped to an intronic region in the LY49B gene on ECA6 ( Table 1). The LY49B gene belongs to the LY49 (Killer cell lectin-like receptor subfamily A) family of receptors, which consists of five functional members in Equus caballus 43 . Other species (including humans) have a functionally similar, but structurally different gene family called KIR (Killer cell immunoglobin line receptors) 44 . The LY49/KIR gene family are expressed across various types of immune cells, and mediate their function through bindings to MHC-1 45 . The LY49B gene is expressed in myeloid cells where it regulates their activity through an inhibitory effect, possibly to prevent their spontaneous activation 46 . Despite the important role that they play in immunity, the function of LY49 genes in development is currently unknown. In humans, incompatibilities between foetal KIR and maternal MHC (HLA) genotypes are associated with an increased risk of miscarriage and preeclampsia [47][48][49] . Additionally, knockdown of LY49 in mice showed a high rate of implantation failure 50,51 . These findings indicate that LY49B may play an important role in maternal/foetal compatibility and implantation success in horses.
Analysis of transcriptomic data found that LY49B was first expressed in equine trophoblast tissue during the placental development stage. The first evidence of LY49B expression was found on day 23-24 of development (Table 3), during which the glycoprotein capsule surrounding the embryo is broken down and placental tissue starts to develop 52 . Measurable expression of LY49B was also found in chorion and chorionic girdle tissues between days 27 and 34 of development (Table S4). During this time, trophoblast cells rapidly proliferate to form the chorionic girdle, which then invades the endometrium to form epithelial cups 53 . It is possible that LY49B is important for successful implantation of the embryo by mediating the action of MHC-1 which is expressed during this time 54,55 . Further investigations into the role of LY49B in equine development would confirm whether impaired function causes lethality and the stage of development at which this occurs.
Variant calling in whole-genome sequence data from 90 domestic horses further confirmed an absence of minor homozygotes for the two SNPs of interest. Three variants closely linked to these SNPs were also identified in these data as the most likely candidates to cause loss of function in the LY49B gene and result in lethality in homozygous state (Table S3). One SNP was a missense variant in the coding region of the LY49B gene that results in the substitution of a negatively charged serine for an aromatic phenylalanine residue. However, lack of conservation of this SNP in LY49 genes across taxa makes it seem unlikely to be a causative variant for embryonic lethality. Two other variants found in the 3′UTR of the LY49B gene were also closely linked to the SNPs identified in the preliminary analysis, and seemed more likely candidates to cause embryonic lethality in homozygous state.
The 3′UTR of a gene is responsible for transcriptional stability through the binding of miRNAs and RNA binding proteins 56 . The addition of the polyadenylation tail to the 3′UTR is also essential to ensure proper processing and translation of the mRNA strand 57 . Mutations in the 3′UTR can lead to degradation of the mRNA, resulting in reduced or inhibited translation even when the gene is transcribed 58 . Variation in the 3′UTR of genes are associated with a number of diseases including Huntington's and breast cancer in humans 59,60 . Additionally, SNPs in the 3′UTR are associated with traits in livestock including milk production in cows, muscularity in sheep and obesity in horses 58,61,62 . Table 3. Gene counts from RNA sequence data of three trophectoderm and three inner cell mass tissue samples from equine embryos. Transcript counts are in fragments per kilobase/million (FPKM). www.nature.com/scientificreports/ Despite the importance of the 3′UTR for the mRNA stability and normal expression of a gene, little is known about how specific polymorphisms can affect post-transcriptional processing. This makes it difficult to identify how the 3′UTR variants identified in this study could affect the translation of LY49B mRNA into a functional protein. The 3′UTR variant 6:38276955G > A was identified as a possible candidate for embryonic lethality because it is highly conserved between all members of the LY49B family (Table 2) in horses, so may play an important role in mRNA stability. The other 3′UTR variant (6:38276742A > T) is found in an AU-rich region at the end of the transcript, so may be important for the addition of the polyadenylation tail. Further examination of the effects that these variants have on post-transcriptional processing would determine if they impact the normal expression of LY49B in horses.
Despite an absence of homozygotes, the two intronic SNPs identified in this study were found at high heterozygote frequencies in the Thoroughbred population. Mares are often covered multiple times in a season, which may explain why a more discernible reduction in fertility has not been observed as a result of the high frequency of this variant. However, the presence of lethal variants at high frequencies may result in more coverings being required for each mare in a season. Currently, there is no evidence that variation in the LY49B gene is associated with phenotypic advantages in horses. However, it is possible that one of the variants linked to these SNPs results in a phenotypic advantage in heterozygotes, which could explain why they have reached such high frequencies in the breed. It is also possible that selective breeding practices favouring a limited number of stallion bloodlines are responsible for this potentially lethal haplotype drifting to high frequencies in the Thoroughbred population. This would be most likely to occur if a stallion that made a large genetic contribution to the population was a carrier. A similar instance has recently been documented in cattle, where a lethal variant at a high frequency was traced back to a sire with an extensive genetic influence on the population 19 .
The presence of this potentially lethal haplotype across many diverse breeds of domestic horses indicates that it may not be the result of a recent mutation present only in the Thoroughbred population. Rather, heterozygotes for this haplotype may have been present in pre-domesticated horses as a rare variant, and have become more frequent in some domestic breeds as the result of population bottlenecks due to breed formations and selective breeding practices. Domestication and breed formation events have been well documented to result in increased deleterious mutation loads in horses and other domestic species 22,31,63,64 . A high proportion of heterozygotes for this haplotype were found in some breeds closely related to the Thoroughbred including the Paint, French Trotter, Morgan and Quarter Horse. Notably, over 70% of Quarter Horse samples included in this study were heterozygous for these SNPs ( Table 1). The Quarter Horse has an open stud book, and higher genetic diversity than the Thoroughbred population 65 , making the high frequency of a potentially lethal haplotype at first surprising. The Quarter Horse dataset reportedly did not contain full or half siblings 65 , but the collection of samples from one geographical area may not fully reflect the diversity of the worldwide population. An average relatedness analysis of these samples noted the large genetic influence of one particular Thoroughbred stallion 65 , which may explain the high frequency of heterozygotes observed in this population. However, the extremely high frequency of heterozygotes in this breed may be due to selective breeding favouring these individuals.
The Belgian Draft, Mangalarga Paulista and Tuva breeds also show a high proportion of heterozygotes, but are more distantly related to the Thoroughbred and to each other. Therefore, the high frequency of heterozygotes in these breeds may be due to independent genetic drift events. Heterozygotes for this haplotype were notably absent from one branch of the tree containing small heavy horses from Northern Europe, which are more distantly related to the Thoroughbred. A larger dataset of Exmoor Pony samples from this phylogenetic branch revealed one heterozygote for this haplotype (Table 1). This could be due to a calling error, but it is also possible that these SNPs exist at very low frequencies in these breeds. The small sample size of the genotype data for many individual breeds in this study means that heterozygote frequencies across all subpopulations found throughout the world may deviate from that reported here. However, these data provide an indication of breeds with high proportions of heterozygotes for this region. Analysis of SNP data from Northern-Swedish Coldblooded Trotters identified 22 homozygotes for the SNP at position 6:38278874. It is likely that there has been recombination between this SNP and the causal variant, and may appear more frequent in this population due to differences in breed history and recombination patterns 66 . However, additional analyses are required to explore this further. Overall, our findings suggest that this region shows evidence harbouring a homozygous lethal variant, yet a high proportion of heterozygotes are found across many domestic horse breeds.
In this study, we identified a haplotype at high heterozygote frequencies in the Thoroughbred horse population that is a strong candidate for harbouring a variant causing lethality in homozygous state. Similar analyses on larger datasets in other livestock populations have identified multiple lethal haplotypes, so it is likely that other such variants are present at high frequencies in the Thoroughbred population but were not captured in this study. Additionally, the use of commercial SNP arrays only allows for the identification of variants with high minor allele frequencies in populations. Analysis of larger sample sizes, and using higher density genotype data could allow for identification of other variants associated with lethality in domestic horses. The identification of this potentially lethal haplotype demonstrates the potential implications of heavily favouring a limited number of bloodlines in selective breeding practices. Further characterisation of lethal haplotypes in other breeds would also assist in breeding management to increase per covering fertility rates in domestic horse populations. initial genotyping. Genotype data from a representative sample of Thoroughbreds were used to identify SNPs with a high proportion of heterozygotes, but an absence of homozygotes for one allele. Genome-wide SNP data were generated for 156 Australian Thoroughbred horses by genotyping samples on either the Illumina 70 K Chip (65,102 SNPs) (n = 102) or the Affymetrix 670 K Chip (670,796 SNPs) (n = 54). Common genotyped SNPs between the two arrays were scanned for deviations from the Hardy-Weinberg equilibrium with an absence of homozygotes for one allele using PLINK (version 1.9) 67 . The p values were adjusted using a false discovery rate correction with the R package "qvalue" 68 . Since SNPs with an absence of homozygotes could indicate a calling error, the search was narrowed to only include adjacent SNPs that fit such criteria. The frequencies of the candidate SNPs were then examined in publicly available genotype data from Japanese Thoroughbreds (n = 370) typed on the Affymetrix 670 K Chip 69 and these were added the Thoroughbred sample. The SNP frequencies were then characterised from genotype data from Swedish Warmbloods (n = 380) 70 and Norwegian-Swedish Coldblooded Trotters (n = 646) 71 typed on the Affymetrix 670 K Chip. Publicly available data from Exmoor Ponies (n = 285, typed on the Affymetrix 670 K Chip) 71 , Quarter Horses (n = 137, typed on the Illumina 70 K Chip) 72 and horses of 32 different domestic breeds (n = 582, typed on the Illumina 50 K Chip) 29,30 were also included in this preliminary scan for SNP frequencies. In these data, raw intensities were plotted to check for calling errors. If potential calling errors were detected, SNPs were recalled using a mixture model fitted with an expectation-maximization algorithm in R.

Methods
Variant discovery and mapping. Publicly available whole-genome sequence data were used to further examine the frequencies of the candidate SNPs identified in the initial genotype analysis, and to identify linked variants. Paired end whole-genome sequence data from 90 horses of different domestic breeds were used in this analysis (Table S5). The whole genome datasets were downloaded from the European Nucleotide Archives (ENA, https ://www.ebi.ac.uk/ena) which included horses of different domestic breeds (PRJEB14779, n = 70) and additional Thoroughbred samples (PRJNA168142, n = 16 and PRJNA184688, n = 4) ( Table S5).
The SNP array used in the initial genotyping analysis was developed based on coordinates of the EquCab2.0 reference genome. For consistency, we used the EquCab2.0 assembly as a reference for the whole-genome sequence analysis. The EquCab2.0 assembly was also used because of an issue with resolution in the area of interest on the newer EquCab3.0 assembly. The raw reads were mapped to the EquCab2.0 reference genome using BWA-MEM algorithm from Burrows-Wheeler Alignment Tool (version 0.7.17) 73 . Duplicate reads were flagged using Samblaster (version 0.1.22) 74 , and base recalibration was performed using Genome Analysis Toolkit (GATK) (version 4.0.8.1) 75 . Variants (SNPs and INDELs (insertions and deletions)) were called using Haplotype Caller and then filtered using the standard hard filtering recommendations in GATK 76 . The individual SNPs were then filtered to only include high quality allele calls with an average filtered depth over 10 and a Phred score over 20.
Variants that were linked to the SNPs identified from the genotype data were produced using the LD function in PLINK (version 1.9) with a window size of 5 Mb 67 . Only SNPs with an r 2 value of over 0.8 and a D′ value > 0.9 were shortlisted. The effects of each SNP on gene structure and function was characterised using SIFT (version 4G) 42 . The conservation of variants across taxa was analysed using the NCBI Conserved Domain Database Search 77 . transcriptomic analysis. Publicly available RNA sequence data were used to examine expression levels of the genes of interest in embryonic tissue. The data included equine inner cell mass tissue (collected at day 15, 22 and 25, n = 3) and trophectoderm tissue (collected at day 16, 23 and 24, n = 3) from the Functional Annotation of ANimal Genomes (FAANG) equine biobank (available from ENA under the project name PRJNA223157) 78 . Adaptors were trimmed using bbduk from BBtools (version 37.98) 79 . Reads were aligned to the EquCab 2.0 genome using STAR (version 2.7.2b) 80 . Counts were generated using featurecounts from Subread package(version 1.5.1) 81 , then quantified in fragments per kilobase/million (FPKM) using the R package "edgeR" 82 with the Equus_caballus_Ensembl_94 file used for annotation. Microarray data for chorion (n = 19) and chorionic girdle (n = 19) tissue collected from horse embryos between days 27-34 of development 83 were also examined for gene expression levels. ethics statement. Hair samples from Australian Thoroughbred horses were collected under approval from University of Sydney Ethics Committee (Number: N00-2009-3-5109.) Written informed consent to use the animals in this study was obtained from the owners of the animals. The hair samples from Swedish Warmblood horses were originally collected for parentage testing and stored in the biobank at the Animal Genetics Laboratory, SLU so ethics approval was not applicable. Hair and blood samples of Norwegian-Swedish Coldblooded Trotters were collected under approval from the Ethics Committee for Animal Experiments in Uppsala, Sweden (Number: C 121/14). All the methods were performed in accordance with the guidelines set out by the respective Animal Ethics Committees and the guidelines contained in the Guide for the Care and Use of Laboratory Animals. No experimental procedure was performed on live animals. All other data was downloaded from publicly available repositories.

Data availability
The whole-genome sequence data used in this study is publicly available for The European Nucleotide Archive. Genotype data for Japanese Thoroughbreds The exceptions are genotype data from Australian Thoroughbred, Swedish Warmbloods and North Swedish Coldblooded Trotters, which are available on request but restrictions apply to the availability of these data which were used under license for the current study and so are not publicly available.