A PLAG1 mutation contributed to stature recovery in modern cattle

The recent evolution of cattle is marked by fluctuations in body size. Height in the Bos taurus lineage was reduced by a factor of ~1.5 from the Neolithic to the Middle Ages, and increased again only during the Early Modern Ages. Using haplotype analysis, we found evidence that the bovine PLAG1 mutation (Q) with major effects on body size, weight and reproduction is a >1,000 years old derived allele that increased rapidly in frequency in Northwestern European B. taurus between the 16th and 18th centuries. Towards the 19th and 20th centuries, Q was introgressed into non-European B. taurus and Bos indicus breeds. These data implicate a major role of Q in recent changes in body size in modern cattle, and represent one of the first examples of a genomic sweep in livestock that was driven by selection on a complex trait.


Japanese cattle
The HapMap data are summarized in Supplementary Table 1. The Illumina® BovineHD panel included 1,224 Y-linked markers. Ideally, genotypes at these markers should be recalled manually since both the standard cluster file and the GenTrain2 algorithm in GenomeStudio assume diploidy. However, as this process is extremely laborious, we opted to analyze only markers with obvious two-states hemizygous genotypes in the HapMap data. First, we retained only markers with GenTrain Score greater than 0.80. This threshold guaranteed that clusters AA, AB and BB were well-shaped and clearly separated. Second, animals with call rate lower than 95% at Y-linked markers were excluded, which mainly caused females to be removed from downstream analyses. Next, all remaining heterozygous calls were set to missing and homozygous genotypes were replaced by hemizygous states (0 for AA and 1 for BB). Lastly, only markers with call rate equal to 100% and minor allele frequency greater than 5% in the overall breed data were preserved. This filtering procedure resulted in 29 markers and 563 individuals, which yielded five different haplotypes (YHapG1 -YHapG5, Supplementary Table 2).
YHapG1 and YHapG2 were identified as B. indicus haplotypes, whereas YHapG3, YHapG4 and YHapG5 were recognized as B. taurus haplotypes.
Additionally, YHapG3 was nearly fixed in six out of the nine Northwestern European breeds that were highly selected for Q. Importantly, similar to crossbred and admixed populations, Wagyu presented multiple Y chromosome haplotypes, including YHapG3, which is consistent with admixture and supports the hypothesis of introgression of Q from Northwestern European cattle into Japanese breeds. First, in order to confirm whether the polled allele in Nellore was indeed of B.
At the P C and P F regions, the significant haplotypes were TGAAAG and TGTAGG, respectively. Both were the major haplotypes in most of the European breeds in the HapMap data, except for Romagnola (see Supplementary Tables 3 and 4). These haplotypes were also rare or absent in African B. taurus, B. indicus and outgroup species. This confirms that polledness in Nellore is a B. taurus contribution.
Frequency of the PLAG1 haplotype GGGTTC was highly correlated with frequency of TGAAAG or TGTAGG (r = 0.787, p = 2.53 x 10 -7 ) in worldwide cattle (Supplementary We evaluated whether the 24 sequenced Nellore bulls could serve as a reference set for imputation of HD genotypes up to sequence variants. The main idea was that, if accurate predictions of whole genome sequence genotypes could be achieved using these bulls, the haplotype association analysis presented in the main paper could be replaced by direct analysis of sequence variants. Since test animals had only HD data available, extraction of HD genotypes was performed on the 24 sequenced bulls, and test animals had their genotypes reduced to panels of smaller marker density that partially overlapped with the HD array (Supplementary Table 6).
These panels included the industry's standard Illumina® BovineSNP50 v2 (50K) and two B. indicus-specific panels, namely GeneSeek® Genomic Profiler Bos Indicus HD (GGP75Ki) and Illumina® Z-Chip v1 (ZChip). FImpute v2.2 4 was used to impute genotypes, as it has been shown to yield comparable accuracies with competing algorithms with substantially smaller run times 5,6 . In comparison to Beagle 7 , FImpute was found to yield higher accuracies in a wide range of imputation scenarios in this Nellore population 8 . Briefly, FImpute's method is based on a deterministic approach that assumes all animals are related to each other to some degree. Initially, genotypes are predicted using chromosome windows with shared long haplotypes (identical-by-descent segments due to recent ancestors). Then, genotypes are further imputed using shorter windows to capture information from more distant relatives.
Accuracy of imputation was measured as the percentage of correctly imputed genotypes. Average accuracies were 81.50 ± 4.21%, 89.10 ± 2.73% and 82.41 ± 4.12% for 50K, GGP75Ki and ZChip, respectively. These results indicated that a larger sample of reference animals would be required for accurate imputation. We re-analyzed the birth weight data in order to assess how population stratification and haplotype size could affect the association results. We first used the same method described in the material and methods with haplotypes of size 1 (equivalent to performing a standard single-marker analysis), 5, 10 and 20, paralleling our original analysis of 6-markers haplotypes. Then, we repeated all analyses considering the following mixed linear model:

Supplementary
where y is the vector of dEBVs, 1 is a vector of 1s,  is the intercept, u is the vector of polygenic effects (i.e., sum of the random effects of genome-wide markers), and e is the vector of residual effects. The model assumed y ~ N(1, V) for V = K u 2 + W e 2 , where K is the genomic relationship matrix (GRM) excluding the tested chromosome, W is the dEBV weight matrix as described in the material and methods, and  u 2 and  e 2 are the variance components related to polygenic and residual effects, respectively. Variance components were estimated using restricted maximum likelihood (REML) with the Newton-Raphson algorithm. Similarly to the original analysis, we computed  k = (z k V -1 z k ) -1 z k V -1 (y -1) and SE( k ) = (z k V -1 z k ) -1/2 for each haplotype k, and the association mapping was based on the two-tailed t-test  k / SE( k ). This approach was very similar to the standard Leave-One-Chromosome-Out Mixed Linear Model Association (MLMA-LOCO) method 9 , except that our analysis considered haplotypes instead of single-markers, and that heterogeneity of residual variance was explicitly modeled.
We found that association results were robust in respect to variations in haplotype size and presence of population structure (Supplementary Fig. 2).
Considering no correction for genetic stratification, the most significant associations