Rare and common variant discovery by whole-genome sequencing of 101 Thoroughbred racehorses

The Thoroughbred breed was formed by crossing Oriental horse breeds and British native horses and is currently used in horseracing worldwide. In this study, we constructed a single-nucleotide variant (SNV) database using data from 101 Thoroughbred racehorses. Whole genome sequencing (WGS) revealed 11,570,312 and 602,756 SNVs in autosomal (1–31) and X chromosomes, respectively, yielding a total of 12,173,068 SNVs. About 6.9% of identified SNVs were rare variants observed only in one allele in 101 horses. The number of SNVs detected in individual horses ranged from 4.8 to 5.3 million. Individual horses had a maximum of 25,554 rare variants; several of these were functional variants, such as non-synonymous substitutions, start-gained, start-lost, stop-gained, and stop-lost variants. Therefore, these rare variants may affect differences in traits and phenotypes among individuals. When observing the distribution of rare variants among horses, one breeding stallion had a smaller number of rare variants compared to other horses, suggesting that the frequency of rare variants in the Japanese Thoroughbred population increases through breeding. In addition, our variant database may provide useful basic information for industrial applications, such as the detection of genetically modified racehorses in gene-doping control and pedigree-registration of racehorses using SNVs as markers.

SNV density by genomic functional region. The SNV density (numbers of detected variants in each chromosomes whose size multiplied by scale factor 1000) was examined in genomic regions with different functions (intergenic, upstream and downstream regions, exon, intron, and untranslated region [UTR]) (Supplementary Table S2). Intergenic regions showed the highest density of SNVs ( Both intergenic and intron regions are thought to have a high mutation rate because of the absence of gene coding regions. The variant rate in the intron was approximately half that in the intergenic region. This may be because some parts of introns are related to steps involved in the protein synthesis process, such as splicing, despite being non-coding for amino acids. Interestingly, the sequence of the UTR was more conserved than that of the exon region. One possible reason is that the UTR is involved in regulating expression by non-coding RNAs and RNA-binding proteins 24,25 . No major differences were observed between the numbers of synonymous and non-synonymous substitutions detected in the gene-coding region (Fig. 2). However, the frequency of non-synonymous substitutions tended to be higher than that of synonymous substitutions in chromosomes 12 and 20, suggesting positive selection for higher variation in the amino acid sequence of proteins, such as those of the major histocompatibility complex 26,27 .  Fig. 3 shows the number of SNVs that had 202 ALT-alleles among the 101 horses. Interestingly, a blip was observed at the position of 101 REF-alleles, which corresponds to a minor allele frequency of 0.5. The horse reference genome sequence was produced using a Thoroughbred mare (Twilight) 18 , whereas genotypes of 3475 SNVs detected in the present study were homozygous for the alternative (ALT)-allele in the 101 horses (left side of Fig. 3), suggesting that Twilight may have unique SNVs that were not observed in our Thoroughbred population.
As can be seen on the right side of Fig. 3, 802,454 SNVs had only one ALT-allele among the 101 horses (total 202 alleles). Although depending on the definition of rare variants and/or individual numbers analysed, approximately 6.9% of the SNVs detected were rare variants in the Thoroughbred population. As we conducted WGS without PCR amplification, any bias in variant detection was expected to be low.
A similar trend was observed in variant detection using 88 horses from 25 breeds in horses 21 , and 12.5% of detected variants were specific to individual horses. A high frequency of rare variants was also observed in SNVs obtained from the 1000 Genomes Project in humans 28,29 . Interestingly, low-read depth WGS from 3,781 individuals of British ancestry identified over 42 million SNVs, approximately 80% of which were rare variants 30 . Therefore, by increasing the number of Thoroughbreds used for WGS, further rare variants may be identified.
Application of common variants. Many SNVs were identified as common variants in this study. These SNVs may be useful as markers for Thoroughbred registration. Although short tandem repeats have been currently used internationally in parentage testing of Thoroughbreds, a test using SNVs has been considered by the International Society for Animal Genetics with 50 SNVs that we previously developed as candidates 31 . When searching the variant database for these candidates, polymorphic information can be easily extracted for all SNVs excluding one located at position 20379456 on ECA9 (Supplementary Table S3). Because reads mapped to the genome region containing the SNV on ECA9 showed low-mapping quality scores based on multiple-region mapping of reads, variant calling by GATK excluded this region. Therefore, SNVs on genomic regions with low-quality scores should not be used as markers for Thoroughbred registration. Additional SNVs for parentage verification in Thoroughbreds can be identified in our variant data. Blue: A to T or T to A; orange: A to G (T to C) or G to A (C to T); grey: A to C (T to G) or C to A (G to T); yellow: G to C or C to G.  Table S4). Interestingly, 13,402 of these SNVs were in the pericentromeric region (position: 1-2,293,071, approximately 2.3 Mb) of ECA29 ( Supplementary Fig. S1), whereas the other SNVs were widely distributed in short lengths (several hundred base pairs). Analyses of individual mapping data using the Integrative Genomics Viewer and/or read depth values in VCF-files showed that the read depth (sequencing coverage) of the pericentromeric region was approximately twice that of the adjacent regions, and SNVs were distributed more densely in the pericentromeric region. The pericentromeric region of ECA29 may include duplicated sequences detected as pseudo-SNVs. When analysing many individuals to confirm the genotype frequency, duplicated regions in the reference genome may be identified even in short read sequence data (150 bp). However, in the present study, we did not exclude SNVs with all-heterozygous genotypes from the total number of SNVs detected because we could not determine absolutely if they are duplication regions using only data from this study.

Rare variants.
In this study, we defined SNVs with only one ALT-allele detected among 101 horses as rare variants. When classified by birth year, the number of rare variants for earlier-born Thoroughbred individuals was lower than that for later-born individuals (Fig. 4), whereas horses born in later years tended to have higher numbers of rare variants. Horses born in 1985,1990,1993, and 1996 had 3994, 6713, 6279, and 4397 rare variants, respectively. Interestingly, these horses were used as stallions in Japan and had produced over 200 offspring. Particularly, the horse born in 1985 had over 500 offspring, several of which have become stallions and produced many progenies in Japan. In this case, it was considered that variants of the horse born in 1985 were inherited and increased in frequency within Japanese Thoroughbred horse populations, resulting in fewer rare variants.
Rare variants in individuals were characterised (Supplementary Table S5) and found to be present in coding region of genes as non-synonymous substitutions, start-gained, start-lost, stop-gained, and stop-lost, suggesting that the variants affect the phenotypes of each Thoroughbred horse as major and/or minor effects. Therefore, these functional rare variants in individual horses may explain the missing heritability based on the common disease/rare variant hypothesis 32 .
A variant identified in the horse myostatin gene was strongly associated with the optimal flat-racing distance in Thoroughbreds 4,5 . This variant may have been less frequent in the early Thoroughbred population 33 and its frequency may have increased in current Thoroughbred populations by selective breeding through environmental adaptations (race distance), as short-distance races (1000 and 1200 m) have become more common over time.  11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  www.nature.com/scientificreports/ Therefore, some rare variants identified here may also contribute to racehorse performance and/or any traits in future populations by selective breeding through environmental adaptations.
Allele frequencies of functional genes. Constructing a variant database in a Thoroughbred population facilitated easier identification of variants of annotated genes and their frequencies in the population. In horseracing, the generation and use of genetically modified horses are prohibited by the IFHA and International Stud Book Committee (ISBC). In our previous studies, we targeted 12 genes for gene-doping control in horseracing 13,15 . In the Thoroughbred population, we detected non-synonymous SNVs of some of the 12 genes in our variant database: 1 in GH1, 3 in PCK1, 1 in VEGF, and 2 in ZFAT ( Table 2). This information may be useful for clarifying whether variants detected in doping tests are naturally or artificially introduced. The drug-metabolising enzymes CYP1A1 and CYP1A2 also had one and four non-synonymous SNVs, respectively. In humans, CYP1A2 metabolises many low-molecular-weight compounds including caffeine and lidocaine 34 , which are banned as doping substances. Although the substrates in horse CYP1A2 are not completely clear 35 , differences in drug metabolism ability among individuals may be affected by SNVs with non-synonymous substitutions.
GWAS have been performed to identify SNVs associated with target traits and identified causative loci in Thoroughbreds [3][4][5] . It may be difficult to identify causative variants by only GWAS because common variants are used; however, the variant information of annotated genes constructed in this study can help in this identification through genotype imputation 36 . Diversity of the mitochondrial genome. By mapping to the reference mitochondrial genome (16,660 bp) 37 , 335 SNVs were detected in 101 Thoroughbred horses (Fig. 5a). Seven of these SNVs occurred only as ALT-alleles and 58 of these SNVs were non-synonymous mutations among the 101 Thoroughbred horses (Fig. 5b).
Haplotypes of the mitochondrial genome were identified among the 101 horses and grouped by their founder mares based on Stud Book pedigree information. Interestingly, even if horses had the same founder mare in their female lines, the identified haplotypes clearly differed in several individuals. These discrepancies may result from errors in the original pedigree registration, and similar mistakes have been reported based on variations of mitochondrial genome 38 . In contrast, although the overall haplotype structures were consistent in the same founder mare, horses with a small number of mutations were also observed 39 . This may be because of natural mutations in their evolution over 300 years (approximately 30 generations) or miscalling of the identified variants.   www.nature.com/scientificreports/ Concluding remarks. In this study, we constructed a genome-wide variant database from 101 Thoroughbred racehorses who did not have sibling or parent-child relationships. The number of detected SNVs may be overestimated because of miscalling from several duplicated regions, but our results revealed approximately 12 million SNVs among Thoroughbreds and that around 6.9% of these are rare variants. A limitation of our dataset is the fact that current algorithms for mapping and variant calling will detect fake SNVs in duplicated regions in addition to genuine SNVs. Consequently, it is possible that our data could include up to tens of thousands of false-positive variant calls. Therefore, the improvement of algorithms for mapping and variant calling should be a future research priority.   www.nature.com/scientificreports/ The detected rare variants included many functional variants, such as non-synonymous variants, suggesting that rare functional variants affecting protein functions reflect individual phenotypes. In humans, rare variants are known to play a key role in many complex diseases, and rare variants in horses may also play a key role in Thoroughbred disease and/or racing performance.
The Equine Genetics and Thoroughbred Parentage Testing Standardization Committee of International Society for Animal Genetics is investigating the possibility of moving from short tandem repeats to SNVs as markers for determining parentage in Thoroughbreds. In the present study, we identified many SNVs as common variants and found duplicated regions in the horse genome and multiple mapped regions of sequence reads. Therefore, candidate SNVs for parentage verification should be selected from the common variants and exclude SNVs detected at these duplicated regions. The variant database in the present study can be used to confirm this.
Currently, the generation and use of genetically modified racehorses is banned by the ISBC and IFHA in horseracing. In the present study, we targeted only Thoroughbreds and determined the extent of Thoroughbred genomic diversity among the population of racehorses. Our findings will be useful as baseline information for gene-doping tests that use whole-genome and targeted resequencing.

Materials and methods
Animal ethics. All the experimental protocols were approved by the Animal Care Committee of the Labora-  Briefly, using QCleaner, reads with a low-quality base (< 20 Phred score) were removed. The remaining reads were filtered and removed if 80% of their nucleotides had a quality value < 20, had sequences of over five unknown nucleotides, had only < 32 bp length sequences, or did not have mate-pairs. Only high-quality sequences were selected.
The selected reads were aligned to the horse reference genome sequence EquCab3.0 assembly from GenBank (GCA_002863925.1) using Burrows-Wheeler aligner with default parameters. Alignments were converted from sequence alignment/map format to sorted, indexed binary alignment/map files (SAMtools, version 1.8), and then the Picard tool was used to remove duplicate reads. Finally, binary alignment/map files were constructed with mapping data to the horse reference genome.
Database construction and statistical analyses. Information from VCF files for individual horses, chromosome, position, reference allele, alternative allele, gene, HGVSp, annotation, and annotation impact was collected into a single file using the Vcf2sql (Amelieff Co.), from which allele frequencies and SNP density (numbers of detected variants in each chromosomes whose size multiplied by scale factor 1000) were calculated.

Data availability
Variant data in 101 Thoroughbred can be accessed through the Open Science Framework, https:// doi. org/ 10. 17605/ OSF. IO/ PVNCY. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.