The horse Y chromosome as an informative marker for tracing sire lines

Analysis of the Y chromosome is the best-established way to reconstruct paternal family history in humans. Here, we applied fine-scaled Y-chromosomal haplotyping in horses with biallelic markers and demonstrate the potential of our approach to address the ancestry of sire lines. We de novo assembled a draft reference of the male-specific region of the Y chromosome from Illumina short reads and then screened 5.8 million basepairs for variants in 130 specimens from intensively selected and rural breeds and nine Przewalski’s horses. Among domestic horses we confirmed the predominance of a young’crown haplogroup’ in Central European and North American breeds. Within the crown, we distinguished 58 haplotypes based on 211 variants, forming three major haplogroups. In addition to two previously characterised haplogroups, one observed in Arabian/Coldblooded and the other in Turkoman/Thoroughbred horses, we uncovered a third haplogroup containing Iberian lines and a North African Barb Horse. In a genealogical showcase, we distinguished the patrilines of the three English Thoroughbred founder stallions and resolved a historic controversy over the parentage of the horse ‘Galopin’, born in 1872. We observed two nearly instantaneous radiations in the history of Central and Northern European Y-chromosomal lineages that both occurred after domestication 5,500 years ago.


Supplementary Figure S3
LipY764 variant calls after filtering -example when generating a combined vcf for all 139 samples using CombineGVCFs The first bar shows the number of total calls on LipY764. The number of variants is almost halved when filtering for variants in scY regions only. A large fraction was excluded when the filter for reference errors, phased calls and calls with multiple alternatives was applied. The fourth bar shows the number of variants left when excluding variants with a genotype quality of less than ten and variants that were not covered by more than two reads in at least one individual.

Supplementary Figure S7
Phylogenetic tree resulting from dating with BEAST. This tree is based on 1,856 SNVs (see "variant used for dating" in Supplementary Table S8) found in the 63 samples with a mean scY coverage of at least six (see "scY mean coverage" in Supplementary Table S1). The X axis shows node ages as million years ago (mya). Node bars correspond to the 95% highest posterior density intervals for estimated node dates. The topology corresponds to that inferred with MP and ML methods ( Fig. 2 and Supplementary Fig. S5). We named the four most important  blastn -query NONREPMSY -db MONGOLIANHORSE -outfmt "6 qseqid sseqid length pident nident qstart qend sstart send evalue bitscore" -perc_identity 95 -out OUTFILE Python v3.5.1 was then used to extract BLAST hits with a minimum length of 150 bp and a minimum identity of 95% to the nonrepMSY. These contigs, together with other published horse Y-chromosomal sequencesthe 1.6 Mbp nonrepMSY 2 , six BAC-clones 5 , horse Y-chromosomal gene fragments 6 and equine Y-chromosomal and XY-homologous GenBank entrieswere merged into a file with multiple FASTA-formatted sequences (bait sequence details listed in Supplementary Table S2). This file was used as reference (bait) to extract Y-specific reads by mapping each Lipizzan male with bwa aln v0.7.15. We removed duplicates and extracted only mapped reads with a minimum mapping quality of 20 and converted them to fasta format with samtools v1.4. To correct for assembly errors, trimmed whole-genome sequence reads from the three Lipizzans were remapped to the raw assembly using the settings described above. To finally run REAPR v1.0.18 the three files were merged. samtools merge all_Lip_to_Lip_Y_Assembly.bam 111_mapped_sort_rmdup_MQ20.bam 113__mapped_sort_rmdup_MQ20.bam 169_mapped_sort_rmdup_MQ20.bam reapr pipeline Lip_Y_raw_Assembly.fa all_Lip_to_Lip_Y_Assembly.bam

Mapping
We mapped ten males and five females to the raw assembly and the equine X chromosome with bwa aln v0.7.15, removed duplicates and filtered for mapped reads using samtools v1.4.

Windows
The Y assembly and selected regions of the PAR of the X chromosome (PARcoordinates were chosen according to Raudsepp and Chowdhary 7

Calculation of female background coverage in confirmed Y regions
The background coverage b of Y regions was inferred as the female's mean coverage in windows confirmed to be single-copy Y. To predict assured singlecopy MSY regions, confirmed Y-specific contigs from Wallner et al. 2

Probabilistic model -derivation of the formula
We first obtained the mean coverage of each male and female horse among all windows that could a priori be assigned to the PAR, which we represent with the symbols c i for the ith male horse, with 1 ≤ i ≤ I, and c j for the jth female horse, with 1 ≤ j ≤ J. Next, we obtained the mean background coverage of female horses among all windows that could a priori be assigned to single-copy Yspecific regions, which we represent with b j for the jth female horse. Now we assigned the windows of the REAPR-corrected assembly to either MSY or nonMSY according to a likelihood ratio criterion. Assuming that the kth window is within a nonMSY region, and assuming that the coverages y i and y j for the male and female horses, respectively, are Poisson distributed, we obtain the following likelihood given the window-specific coverage µ k : The maximum likelihood ratio estimator of the parameter µ k is: Assuming that the kth window is within the MSY, we obtain the following likelihood: The maximum likelihood estimator for the window-specific coverage of the male horses, which we now denote with v k (to differentiate from the estimator for both male and female horses m k ), is: The coverages for the female horses are assumed to correspond to the background frequencies b j . The likelihood ratio is obtained by substituting the maximum likelihood estimators into the likelihoods:

Probabilistic model -calculation
The LipY764 mean coverages per window and sample (table "malfem") and the mean PAR window coverages for males (vector "mal_par") and females (vector "fem_par"), to normalise the former, are provided as input for the R script. Also, the female background coverages (vector "constants") are provided. A relative coverage cut-off value less than one was selected to distinguish scY from mcY windows (see Supplementary Fig. S2).

LipY764 gene content and homologies to nonrepMSY and eMSYv3
BLAST v. We set the thresholds for a perfect match to nident > 300 and pident > 95%.

Variant calling and generation of the horse Y phylogeny
Adaptor free and trimmed data (Supplementary Table S1) were mapped to the raw assembly after classification using bwa aln v0.

Test for equality of numbers of mutations on branches
Equality of numbers of mutations on pairs of branches originating from the same node was checked using a chi-square test; p-values were determined by 10,000 simulations in R v3.2.3.

Mutation rate estimate
For mutation rate calculation we assume: