# Effect direction meta-analysis of GWAS identifies extreme, prevalent and shared pleiotropy in a large mammal

## Abstract

In genome-wide association studies (GWAS), variants showing consistent effect directions across populations are considered as true discoveries. We model this information in an Effect Direction MEta-analysis (EDME) to quantify pleiotropy using GWAS of 34 Cholesky-decorrelated traits in 44,000+ cattle with sequence variants. The effect-direction agreement between independent bull and cow datasets was used to quantify the false discovery rate by effect direction (FDRed) and the number of affected traits for prioritised variants. Variants with multi-trait p < 1e–6 affected 122 traits with an average of 10 traits. EDME assigns pleiotropic variants to each trait which informs the biology behind complex traits. New pleiotropic loci are identified, including signals from the cattle FTO locus mirroring its bystander effects on human obesity. When validated in the 1000-Bull Genome database, the prioritized pleiotropic variants consistently predicted expected phenotypic differences between dairy and beef cattle. EDME provides robust approaches to control GWAS FDR and quantify pleiotropy.

## Introduction

Genome-wide association studies (GWAS) find mutations associated with complex traits. A GWAS produces estimates of the effect of each variant on a trait measured by a regression coefficient b and its standard error (se). From this, a p value can be calculated for the null hypothesis that the variant has no association with the trait. Using the p values and the number of significant variants, it is common to calculate a false-discovery rate (FDR)1. However, this conventional FDR depends on the calibration of the p value which in turn depends on the data conforming to the statistical model assumed in calculating the p value.

The important implication of FDR is that true discoveries would be confirmed in a powerful, independent experiment. In this paper we directly use the ability to confirm a finding in an independent study to estimate the FDR by using the proportion of sequence variants which have an effect in the same direction in two independent datasets. We call this FDRed for ‘effect direction’ based on a statistical model we call Effect Direction MEta-analysis (EDME) for effect direction meta-analysis. EDME is easily extended to calculate the FDR by effect directions (FDRed) for a multi-trait GWAS. This will allow us to know the number of traits that a significant variant is associated with and precisely which traits these are.

Due to the recent availability of many GWAS results with large sample sizes in humans2,3, meta-analysis of public GWAS summary statistics significantly improved our understanding of pleiotropy4,5,6,7,8. Naturally, traits that are genetically correlated must share part of their causal variants, however, if traits are uncorrelated, can we still detect widespread pleiotropy?

Our study starts with multi-trait GWAS of individual data where the traits are decorrelated by a Cholesky transformation9. After GWAS, we focus on using the information of the consistency of variant effect directions to identify pleiotropic variants associated with uncorrelated traits, instead of distinguishing different types of pleiotropy10. We aim to use: (1) the consistency of variant effects in different populations to evaluate the FDRed independent of their p values; (2) the information on the consistency of variant effects for multiple traits across different populations to quantify pleiotropy at both the variant and trait levels. We propose an EDME of GWAS of 34 complex traits of over 44,000 dairy bulls and cows with over 17.6 million sequence variants. A validation analysis with the 1000-bull genome database confirms the informativeness of pleiotropic variants prioritised by the EDME model.

## Results

### Single-trait GWAS and conventional multi-trait meta-analysis in bull and cow populations

The phenotype of the cows was based on their own record for each trait corrected for fixed effects, while that of the bulls was based on their daughter’s performance. The phenotype correlation matrix among the 34 traits was calculated separately in the bulls and cows and an average correlated matrix was derived. This average correlation matrix was used to calculate a Cholesky factor (L matrix)9 to be applied to the 34 traits. The Cholesky transformation decorrelates the 34 traits by ordering the traits from 1 to 34 and correcting each trait for all the traits before it in the list. We use this method because the traits vary in their completeness, i.e., some traits have more records than others. By ordering the traits from the trait with most complete data to the trait with least complete data, we can make use of all the data because the Cholesky transformation of the Kth trait in the list only needs data on the K − 1 traits and individuals with trait K to have data on all K − 1 traits. Thus the Cholesky transformation produces uncorrelated traits and makes maximal use of incomplete data (Supplementary Data 1, Supplementary Figs. 1 and 2). The average relatedness between the bull and cow populations was around 0 (Supplementary Fig. 3). The transformation produced traits which were almost uncorrelated, but the correlations were not exactly zero in either the cow or bull populations, because the correlation matrices differed slightly between the sexes (Supplementary Fig. 2). The variance of each trait was close to 1 (Supplementary Data 1). For clarity in what follows, when describing results specifically related to Cholesky-transformed traits, a label ‘CT’ was used. For example, a trait called CT temperament was interpreted as the eighth trait in the list corrected for its preceding seven traits. When describing results related to original traits that were not Cholesky-transformed, a label ‘raw’ was used, e.g., raw temperament. The raw protein yield (prot, the first trait) was identical to the CT protein yield.

On average across 34 CT traits, the heritability was 0.42 (±0.035, standard deviation) in bulls and 0.164 (±0.029) in cows (Supplementary Data 1), which was expected as bull phenotypes were more accurate11. On average at p < 1e−6, 1900 (±813) variants were significant per trait in bulls and 2018 (±795) in cows. The FDR calculated by conventional methods (Bolormaa et al.1) at p = 1e−6 was 0.22 (±0.045) for bulls and 0.1(±0.021) for cows across 34 CT traits.

Using the established meta-analysis method of ref. 1, a multi-trait GWAS was calculated within the bulls and cows separately (Supplementary Fig. 4a, b). Overall, many significant variants (associated with at least one CT trait) detected in one sex were significant in the other sex (Supplementary Fig. 4c and Supplementary Data 2). At the p < 1e−6 level, 25,454 variants were significant in bulls, 31,076 in cows and 14,587 in both bulls and cows. A weighted meta-analysis combining variant t value from bulls and cows further increased the number of significant variants to 93,513 (Fig. 1 and Supplementary Data 2). The conventional FDR for all types of meta-analysis was smaller than 0.1% (Supplementary Data 3). A variant clustering analysis supports the existence of pleiotropic QTL distinguished from linkage disequilibrium (LD) between sequence variants (see Methods and Supplementary Fig. 5).

### The EDME model

We introduce the EDME model to analyse GWAS summary statistics for 34 CT traits estimated from the bull and cow populations separately. In the EDME model, the trait-associated variants were categorised as: ‘true negative’ (T) variants that truly decrease the trait value, the ‘true positive’ (T+) variants that truly increase the trait value and the ‘true zero’ (T0) variants with no reliable effects on the trait (the left panel of Fig. 2a). However, the observed effect of each variant on a trait was classified as positive or negative so that when the effect of a variant in both bulls and cows was described it fell into one of the classes ‘++’, ‘−−’, ‘+−’ and ‘−+’ (Fig. 2a and Methods).

We assume that true positives were always observed in the ++ class and true negatives in the −− class and that true zero effect variants have one-fourth probability of being in any of the four classes. This then allowed the calculations

$${\hskip 45pt}T_0 = 2 \times n\left( A \right),$$
(1)
$${\hskip 5pt} T_ - + T_ + = T_r = n\left( B \right) - \frac{{T_0}}{2},$$
(2)
$${\rm{Single}}{\hbox{-}} {\rm{trait}}\;{\rm{FDRed}} = \frac{{T_0}}{{n\left( A \right) + n\left( B \right)}},$$
(3)
$${\rm{Single}}{\hbox{-}}{\rm{trait}}\;{\rm{TDRed}} = \frac{{T_r}}{{n\left( A \right) + n\left( B \right)}}.$$
(4)

Where n(A) was the observed number of variants with inconsistent effect directions (‘+−’ and ‘−+’) and n(B) was the observed number of variants with consistent effect directions (‘++’ and ‘−−’) in GWAS of two different populations. Tr denoted the combined number of T and T+. Then, the FDR and TDR by effect directions (single-trait FDRed and TDRed, respectively) were calculated as Eqs. 3 and 4.

### The single-trait EDME and FDRed

Based on the number of variants with inconsistent effect directions between sexes (n(A)), the FDRed and conventional FDR averaged across 34 CT traits were compared within a range of p value thresholds imposed in both sexes (Fig. 2b). Across 34 CT traits, the number of variants significant in both sexes ranged from 241,657(±6071) at p = 0.1, 798(±424) at p = 1e−6, to 14(±12) at p = 1e−100 (Fig. 2b). Both types of FDR decreased as the p value threshold was made more stringent, with the FDRed being smaller than the conventional FDR. At the p = 1e−3 level, amongst 2850(±1218) significant variants for both sexes, the single-trait FDRed was 0.032(±0.015) while the conventional FDR was 0.98(±0.015). At the p = 1e−9 level, amongst 408(±213) significant variants for both sexes, the single-trait FDRed dropped to 0 and the conventional FDR was 1.6e−05(±3.4e−06) (Fig. 2b). When imposing the p value threshold only in one sex, the FDRed was on average higher than the conventional FDR until a very small p threshold was achieved (Supplementary Data 4). This may be due to differences in LD structure between cows and bulls and/or to the lack of power to detect real effects in both sexes. However, as shown above, if the consideration was restricted to sequence variants that were significant in both sexes, the agreement in the sign of the effect was high and hence the FDRed was low.

### The multi-trait EDME

The single-trait EDME was extended to multiple traits (see Methods) to calculate FDRed and quantify pleiotropy. For each variant, the dot product of the vector of signed t values of both sexes (Π, Eq. 5, detailed in below) was used to quantify the overall agreement of the effect directions on multiple traits per variant. A positive Π indicates effects in the same direction between bulls and cows, whereas a negative Π indicates the opposite direction of effect between bulls and cows.

$${\hskip 35pt}{\mathrm{{\Pi}}}_i = t_{i1} \cdot t_{i2}^T,$$
(5)
$${\rm{Multi}}{\hbox{-}}{\rm{trait}}\;{\rm{FDRed}} = [n\left( {{\rm{variant}}|{\mathrm{{\Pi}}}_i \, \le \, 0} \right) \times 2]/n({\rm{total}}),$$
(6)
$${\rm{Estimated}}\;{\rm{number}}\;{\rm{of}}\;{\rm{TE}} = 2 \,{\times} \mathop {\sum }\limits_1^K n\left( B \right)_{{\rm{MTR}}} \, - \,K,$$
(7)
$${\rm{Number}}\;{\rm{of}}\;{\rm{TC}}_{Tr \times V} = n\left( {{\rm{variant}}} \right) \,{\times}\, \overline {{\rm{TE}}}.$$
(8)

For Eq. 5, ti1 was the t value vector for varianti of population 1 with K elements and ti2 was the t value vector for varianti of population 2 with K elements; K was the total number of traits analysed in each population (K = 34 in this study). For a group of variants, Π values allowed the use of the same logic calculating the single-trait FDRed (Eqs. 1 and 3) to calculate a multi-trait FDRed (Eq. 6). For Eq. 6, n(total) was the total number of variants selected.

As well as this multi-trait estimate of FDR, it is also possible, for each sequence variant, to count the number of traits with the same direction of effect in population 1 (bulls) and population 2 (cows). Allowing for the fact that half the traits were expected to be in the same direction by chance, the EDME logic can estimate the number of traits showing a True Effect (TE) for each variant (Eq. 7). While this estimate was subject to large sampling error for any variant, its average across a large group of variants was consistent (as shown in the results described in the following). For Eq. 7, $$\sum_1^K n\left( B \right)_{{\rm{MTR}}}$$ was a natural number ranging from 0 to K, where for each trait (out of K traits), $$n\left( B \right)_{\rm{MTR}}$$ = 1 if the variant has the same effect direction and $$n\left( B \right)_{\rm{MTR}}$$ = 0 if the variant has different effect directions between the two populations.

The estimated number of TE averaged across a group of variants $$\left( {\overline {\rm{TE}} } \right)$$ × the number of variants in that group (n(variant), Eq. 8) estimated the number of true combinations (TC) of trait by variant effects, $${\rm{TC}}_{Tr \times V}$$. $${\rm{TC}}_{Tr \times V}$$ can be used as a cut-off to prioritise informative trait-variant combinations. Two additional pieces of information can be extracted from the prioritised trait-variant combinations: (i) the observed number of TE for each variant suggesting the extent of pleiotropy at the variant level; (ii) the pleiotropic variants related to each trait, variantstrait, suggesting pleiotropic variants that affect each trait (illustrated in the following results).

### The genome-wide characteristics of multi-trait FDR and pleiotropy

Similar to the single-trait FDRed described above, the multi-trait FDRed remained smaller than the conventional FDR when imposing the multi-trait p value threshold in both sexes (Fig. 2c). The number of variants significant in both sexes ranged from 641,399 at multi-trait p = 0.1, 14,587 at p = 1e−6, to 631 at p = 1e−100 (Fig. 2b). At the multi-trait p = 1e−2 level, amongst 89,223 significant variants for both sexes, the multi-trait FDRed was 0.02 while the conventional FDR = 1. At p = 1e−3, the multi-trait FDRed was 0.003 (conventional FDR = 0.49) and it dropped to 0 at the p = 1e−6 level where the estimated number of TE per variant was 9.63. That is, the average number of traits affected by a single-nucleotide polymorphisms (SNP) in this group is 9.63. Additional analysis with functional data12,13 indicated that both single-trait and multi-trait FDRed can help find informative variants at lenient p value thresholds where conventional FDR was large (Supplementary Note 1). When imposing a significance threshold in only one sex, at the multi-trait p value threshold of 1e−6, the multi-trait FDRed was 0.14 in bulls and 0.2 in cows (Supplementary Data 5).

To further dissect the pleiotropy with a group of informative variants, 93,513 variants with the weighted multi-trait p value (pwm) < 1e−6 from the weighted meta-analysis combining GWAS results of bulls and cows (Fig. 1), were selected. Applying Eq. 8 to this group of variants, the number of TC of trait by variant $$\left( {\rm{TC}}_{Tr \times v} \right)$$ was estimated as 93,513 × 9.63 $$\left( {{\mathrm{estimated}}\;\overline {\rm{TE}} } \right)$$ = 900,820 at this p threshold. This indicated that 900,820 combinations of trait by variant can be interpreted as ‘true’. To specify these combinations, we chose those ranked as top 900,820 by their absolute weighted t values combing sexes across 34 CT traits. These combinations of trait by variant contained 92,537 unique variants with Π > 0 (99% of 93,513 variants with pwm < 1e−6), and their Π values, which indicate the effect direction agreement for multi-trait analysis, were correlated with their conventional FDR based on pwm (Fig. 3a).

For each of the prioritised 92,537 variants, we counted the number of traits with which they were associated among the ‘true’ 900,820 combinations of trait by variant and this was defined as the observed number of TE. The observed number of TE for each variant ranged from 1 to 22 with a mean of 9.7 (Fig. 3b and Table 1). Also, these prioritised pleiotropic variants tagged up to 3550 (11.6% of 30,514) and 3958 (13% of 30,440) clusters representing different QTL in bulls and cows, respectively. Based on the prioritised pleiotropic variants that were also located within these clusters, each cluster had 9.79 and 9.71 (ranging from 1 to 22) observed $$\overline {\rm{TE}}$$ in bulls and cows, respectively. These results were consistent with the average number of $$\overline {\rm{TE}}$$ per variant (9.63) estimated previously (Eq. 7, Fig. 4b). The pleiotropic characteristics of these variants were detailed in Table 1. The pwm and conventional FDR decreased with the increase of the Π value which was indicative of the consistency of effect direction between two populations. A list of the top 500 variants based on Π value within each category of pleiotropy was shown in Supplementary Data 6.

The most pleiotropic variants accounting for 0.00003% of the genome with observed TE on 22 CT traits were close to the GC region (Chr6:88.8M) (Supplementary Data 6). Variants with TE on 22 CT traits also tagged the gene ZNF613 (Chr18:58.1M) which was close to the CTU1 region (Chr18:57.5M). Variants from the CTU1 region had observed TE on 20 CT traits. Both GC and CTU1 loci were reported as lead dairy cattle pleiotropic loci in the 2017 pleiotropy study9. Other commonly known major loci for dairy cattle (see refs. 9,11,14,15,16) were also captured, including DGAT1 (Chr14:1.8M) that had TE on up to 17 CT traits, MGST1 (Chr5:93.9M) that had TE on 13 CT traits, CSN1S1 (Chr6:87.1M) that had TE on 12 CT traits, GHR (Chr20:32M) that had TE on 15 CT traits and ENSBTAG00000048091 (beta-lactoglobulin-2-like), a potentially duplicated gene of the adjacent PAEP (beta-lactoglobulin, Chr11:103.2M) that had TE on 11 CT traits.

Less known pleiotropic loci were also identified. A highly pleiotropic loci PC (Chr29:45.6M), coding for the enzyme pyruvate carboxylase which is needed for gluconeogenesis and lipogenesis, had TE on 18 CT traits. A pleiotropic locus with TE on 12 CT traits was SLC22A6 (Chr29:41.9M) which had reported roles in regulating water metabolism in mammals17. In addition, variants from the X chromosome (chromosome 30) with pleiotropic effects were also identified. Tagged genes included CD40LG (Chr30:20.2M) and ARHGEF6 (Chr30:20.3M), although these variants only affected 2–3 CT traits (Supplementary Data 6).

Further analysis of those informative pleiotropic variants showed that variants affecting a larger number of CT traits had overall larger effect sizes across 34 CT traits, compared to those variants that affected a smaller number of CT traits (Fig. 3c). Many variants that showed widespread pleiotropic effects, i.e., TE > 8, had large effect size variation across 34 CT traits (e.g., the outliers in boxplots of Fig. 3d). This meant that these pleiotropic variants could have large effects on some traits but small effects on other traits. Some noticeable variants/loci associated with multiple CT trait groups were detailed in Supplementary Note 2, Supplementary Figure 6, Supplementary Datas 7 and 8.

### Characteristics of trait-related variants

As described above, the 900,820 combinations of trait by variant assigned pleiotropic variants to each CT trait (variantstrait). Each set of variantstrait was tested for whether the variantstrait had specific associations with the trait that they were assigned to. For each trait, the single-trait FDRed was calculated using the trait-related variantstrait. Such single-trait FDRed (average < 0.1) was much smaller than the single-trait FDRed calculated using those general pleiotropic variants associated with any one of the 34 traits (multi-trait pwm < 1e−6, average single-trait FDRed > 0.7) (Supplementary Fig. 7a). Also, previously identified variants regulating cattle stature18 had the strongest enrichment (adjusted p = 1.9e–08) in the variantstrait related to dairy cattle CT stature (trait 19) (Supplementary Fig. 7b). These results support that the variantstrait had specific associations with the traits they were related to. However, we note that these variants were related to CT traits and the interpretation of them is different from the raw traits.

### The sharing of pleiotropic variants between traits

Among the 900,820 combinations of trait by variant, we counted the number of variants that affected each pair of CT traits (Fig. 4a). Many pairs of CT traits have higher counts than expected by chance. While many pleiotropic variantstrait were detected for CT milk production traits (the first four traits) and they had significant sharing of pleiotropic variantstrait amongst themselves, CT milk production traits did not have significant sharing of pleiotropic variantstrait with every other CT trait. Although traits like CT body condition score (BCS) had the least amount of pleiotropic variantstrait, CT BCS had significant sharing of pleiotropic variantstrait with a wide range of other CT traits.

For each CT trait, the pleiotropic variantstrait sharing index, i.e., the number of pleiotropic variantstrait overlapping divided by the number in the union for each trait pair, averaged across the other 33 traits was calculated. As shown in Fig. 4b, the average pleiotropic variantstrait sharing index was used to rank CT traits from ‘influential’ (sharing pleiotropic variantstrait with many traits) to ‘independent’ (sharing pleiotropic variantstrait with few traits). The first trait protein yield was ranked as the most influential trait, followed by other CT milk production traits. The CT type trait Bone (trait 21), the original trait of which indicated the flatness of legbone of the cattle but is negatively correlated with fatness, ranked the fifth for its influence. The last CT trait BCS ranked relatively high for its influence, given its small number of pleiotropic variantstrait. The CT temperament (trait 8) and the overall type (OType, trait 25) were ranked as relatively independent amongst traits analysed (Fig. 4b).

### Raw trait relationships informed by Cholesky decorrelation

The aforementioned results showed that the CT milk production traits had limited sharing of pleiotropic variantstrait with other CT traits (Fig. 4a). The Cholesky transformation corrected each target raw trait for all preceding raw traits. For instance, the CT survival (trait 7) was corrected for raw traits 1–6 which included the trait protein yield (trait 1). The correction used the overall regression coefficient. Therefore, a sequence variant that affected protein yield and raw survival by the amount predicted by the regression of raw survival on protein predicts, had no effect on the CT survival (trait 7). If all variants that affected raw protein yield had the predicted effects on raw survival, then none of them would have an effect on CT survival. This is indeed almost what we observe (Fig. 5a, b). The simplest interpretation of this pattern of effects is that these variants only affected raw survival because of their effects on protein yield. In other words, protein yield tends to cause raw survival, or farmers cull cows with low protein yield. Figure 5a, b highlights the effects of variants from key genes associated with the raw trait of protein yield and survival.

To verify this causal inference, bidirectional Mendelian Randomisations19 (MR) using variants as instrumental variables implemented in GSMR20 were conducted for 50 raw trait pairs that had the least amount of variant sharing after Cholesky transformation (Supplementary Data 9). Defined as the Bonferroni-correction adjusted pMR < 0.05 for forward MR and adjusted pMR ≥ 0.05 for reverse MR, 8 putative causal relationships were identified (Supplementary Data 9), including protein $$\mathop { \longrightarrow }\limits^{\rm{cause}}$$ survival (Fig. 5c, d). Additional plots of the MR analysis indicated the existence of directional pleiotropy20,21 in the MR of survival protein, but not in the MR of protein $$\mathop { \longrightarrow }\limits^{\rm{cause}}$$ survival (Supplementary Fig. 8). Given that there were a total of 218 pairs with adjusted pMR < 0.05 for forward MR and adjusted pMR ≥ 0.05 for reverse MR and that a total of 2664 MR tested, the enrichment of the 8 putative causal pairs out of 50 was small but different from chance (poverlap = 0.043 based on hypergeometric test). Raw somatic cell count (SCC) also had some putative causal effects on raw mammary related traits (Supplementary Data 9).

### Pleiotropic variants

To the best of our knowledge, we defined the novel pleiotropic variants as those which were outside the ±2 Mb region of the 1181 pleiotropic SNPs with pwm < 1e−6 in the 2017 study9 that used high-density SNP array genotypes to analyse 25 dairy cattle traits. These previously detected 1181 pleiotropic SNPs tag 65,420 sequence variants out of the 92,537 informative pleiotropic variants identified in the current study. According to the annotation of Ensembl VEP22, the remaining 27,117 pleiotropic variants were related to 1347 Ensemble genes and around 10% of these variants were not intergenic or intronic variants. An LD pruned set (LD r2 < 0.1 in 2 Mb sliding windows) of these variants were detailed in Supplementary Data 10. These pleiotropic QTL included variants close (<0.5 Mb) to IGFBP7 (Chr6:74.5 M)23, DNMT3B (Chr13:62.6M)24, SLC24A4 (Chr21:57.6M)25, SLC37A126 and FTO (Chr18:22.1M)27 which were reported as single-trait or production trait QTL previously. However, there were also many loci not previously reported in cattle, such as intergenic variants tagging MCF2 (ChrX:23M), CTNNA2 (Chr11:54.9M), NSF (Chr19:46M) and KRT14 (Chr19: 42.4M).

Two examples of the less reported pleiotropic loci in cattle are given in Fig. 6a, b and the variant effects on 37 original traits were detailed. On cattle chromosome 18 (22.5M), the top variant within the intronic region of FTO gene had pleiotropic effects on raw fat percentage, milk yield and many raw linear assessment traits, including likeability (Like), rear teat placement (TeatPR) and teat length (TeatL) (Fig. 6a). Variants with stronger pleiotropic effects than the FTO variants were found from the intergenic region between the adjacent IRX3 and IRX5 and between AKTIP and RPGRIP1. Variant clustering (see Methods) already indicated the existence of multiple QTL in this region (Fig. 6a). To further verify if these three variants mark more than one causal variant, we fitted all three simultaneously. Although close to FTO, all three variants were significant when fitted together and all three had different patterns of effects across raw traits (Fig. 6a). This observation appeared to mirror the bystander gene case of the human FTO where the regulatory effects of FTO on human body index and obesity risks are dependent on its neighbour genes28,29,30. All three loci had effects on raw temperament (Temp).

On chromosome 19, many pleiotropic variants were identified from the 42.2 to 42.4M region (Fig. 6b), tagging a cluster of keratin genes, usually making up the cornified surface of rumen and skin31, including KRT9, KRT14, KRT15, KRT16, KRT19, KRT35, and KRT36. Again, cluster analysis indicated the existence of several QTL in this region. The joint analysis showed that selected top variants had a correlated pattern of effects on milk production traits. These same variants had a less correlated pattern of effects on raw traits of SCC, temperament and many raw linear type traits of the mammary gland (Fig. 6b), although these effects were relatively small.

### Validating the pleiotropic variants prioritised by EDME and p value

Three sets of pleiotropic variants were selected for validation analysis with 37 raw traits (see Methods and Supplementary Data 11): (1) the 92,537 informative pleiotropic variants prioritised by the EDME model after pwm < 1e−6 selection, (2) 93,513 variants prioritised by pwm < 1e−6 alone and (3) a random set of 93,513 variants. Overall, both the EDME with pwm and pwm alone prioritised pleiotropic variants predicted that dairy cattle were genetically better than the beef cattle for dairy traits and the difference in predictions between beef and dairy was generally lowest in the Random set (Fig. 7). Note that for raw SCC, fertility (Fert), temperament (Temp), milking speed (MSpeed) and likeability (Like), larger gEBV indicates poorer dairy trait performance. Based on our knowledge, larger gEBV in dairy than beef cattle is expected for 13 raw traits (Supplementary Data 12). For these 13 traits, variants prioritised by EDME combined with p value showed the most frequent (9/13 in bulls and 7/13 times in cows) prediction of the largest advantage in gEBV of dairy over beef cattle (Fig. 7) compared to other sets.

## Discussion

In this paper, we focus on finding pleiotropic variants and identifying the traits that each variant affects using EDME. Our approach differing from other published studies in three respects. Firstly, we use individual animal data rather than GWAS summaries. This is partly out of necessity because few summary data are available for cattle and most non-human species, but also, because some information is lost in summaries, it is better to start with individual data when this is available. Secondly, we use uncorrelated traits by applying a Cholesky transformation to the 34 raw traits. The choice of Cholesky transformation maximises the amount of data used by ordering the traits from the trait with the most complete data to the trait with the least complete data. Each CT trait is the raw trait corrected for all traits before it in this order. Transformed traits are always difficult to interpret but some interpretation of the CT traits is still possible. For instance, survival corrected for milk production traits focusses on reasons for death or culling other than milk production. However, when the number of traits considered increases, CT traits ordered late in the list would have reduced interpretability, but this does not affect the estimation of the multi-trait p value, FDRed and number of traits associated with each variant. Where GWAS is already conducted for many potentially correlated traits, it is also possible to decorrelate the variant-trait association matrix as reported by Jordan et al.7. However, as the Jordan et al. stated, traits decorrelated using their approach do not necessarily correspond to any specific biological traits of interest7. While we do not aim to distinguish between horizontal and vertical pleiotropy, CT traits have a natural interpretation for vertical pleiotropy or Mendelian randomisation where the causal trait precedes targeted traits. However, we have limited power to distinguish cause and effect. Thirdly, we use two separate populations (bulls and cows) to estimate the FDR and the number of true discoveries. This method does not depend on the exact calculation of p values which depends on a statistical model that may not describe the data exactly.

The metric FDRed provides an alternative and powerful approach to controlling GWAS false discoveries. The FDRed estimated from the sign agreement between results from bulls and cows is smaller than the conventional FDR when variants are chosen that pass a significance test in both sexes. It is possible that variants with small effects but enriched in functional variants can be prioritised by the FDRed (Supplementary Note 1). When variants are selected based on one population only, FDRed is higher than the conventional FDR. This is probably explained by slight differences between the bulls and cows. The bulls come from a slightly different population (e.g., older) and therefore the pattern of LD was different between cows and bulls. Also, the phenotypic values in cows contained larger error variances than bulls11, therefore, the power for detecting lead variants across 34 traits was different between these two populations, as shown in heritability estimates (Supplementary Data 1), and the correlations between traits were slightly different (Supplementary Fig. 2). These findings suggest that when validating variants using GWAS summary statistics from different populations, the definition of the trait and LD within the populations should be as similar as possible. The genetic correlation between a trait measured in males and females is not always 1.0 and this could create another difference between GWAS results from the two sexes. However, this did not apply in our case because the ‘phenotype’ of the bulls came from the performance of their daughters. These insights are important for the FDR control and validation of GWAS.

EDME quantifies the number of traits associated with variants in general and with particular variants based on the agreement of variant effect-directions for multiple traits across two populations. This unique analysis provides additional quantities for pleiotropy which fills a gap in our previous methods where only the multi-trait p value was estimated to describe the overall magnitude of variant effects on multiple uncorrelated traits9. As mentioned above, Jordan et al.7 decorrelated a variant-trait association matrix to study horizontal pleiotropy where they quantified the magnitude and number of associated traits for all variants that entered the study. Our analysis focuses on identifying pleiotropic variants for uncorrelated traits in general and uses the multi-trait p value to prioritise a set of credible pleiotropic variants that were analysed by EDME to eliminate false discoveries (Π > 0) and dissect pleiotropy. The two types of analysis serve different purposes and data availability in two species.

Among variants with p < 1e−6 in both sexes we estimate that the average number of traits affected per variant (TE) was 9.6 and this supports the conclusion that pleiotropic effects on uncorrelated traits are widespread. This conclusion is also reached by Jordan et al.7 using a different analysis in humans. Widespread pleiotropy was also found by analyses of GWAS of correlated traits in humans6,8. However, EDME showed that variants varied widely in the number of traits affected from 1 to 22. In general, variants with a large effect on one trait tended to have small effects on many traits (Fig. 4c, d). In additional, EDME assigned a set of pleiotropic variants to each CT trait (trait-related variants, variantstrait) to inform trait biology which is supported by published data18 (Supplementary Fig. 7).

We discovered that the lack of sharing of variants between CT traits can to some extent inform causal relationships between raw traits, as supported by MR analysis (Fig. 5 and Supplementary Fig. 8 and Supplementary Data 9). However, a more systematic analysis of raw and CT trait relationships aiming at understanding biology between specific trait groups with larger sample size is required in the future.

A key assumption for conducting EDME is that the GWAS results are obtained for the same traits in two independent populations. It is relatively easy to meet this assumption in dairy cattle since matched phenotypes in independent bull and cow populations are common. It is probably also relatively easy to meet the assumption in humans and mice since large GWAS resources are available2,3,32. However, it is not clear how easy this assumption can be met in species like beef cattle, chicken, fish, flies and plants. Future research applying EDME to these species will help better understand the usage of the information of variant effect-direction agreement from GWAS between populations in eliminating false discoveries.

Our study identifies pleiotropic loci that provide biological insights beyond cattle, including the AKTIP-FTO-IRX3-IRX5 loci where the pleiotropic effects on dairy cattle traits were stronger at the non-FTO regions (Fig. 6a). Another noticeable loci is the region on cattle chromosome 19 (42.2–42.4M) enriched with keratin genes (KRT35-14-16, Fig. 6b). Since KRT genes are highly expressed in epithelial systems31 and it is mammary epithelial cells that secrete milk, mutations in KRT genes might be associated with mammary traits and SCC.

In conclusion, we introduce a set of methods for the EDME using GWAS summary statistics to quantify FDR independent of p values. The multi-trait EDME quantifies the pleiotropy where extreme, prevalent and unevenly distributed pleiotropic effect patterns are comprehensively described. EMDE reliably identifies trait-related variants which help to inform the biology behind complex traits. Identified pleiotropic loci including FTO and KRT update our fundamental knowledge of pleiotropy and gene regulation in large mammalian species. Prioritised pleiotropic variants are supported by our validation with the 1000-bull Genome data.

## Materials and methods

### Animals, phenotype and genotype data

No live animals were used in this study. Phenotype data was based on trait deviations for cows and daughter trait deviations for bulls (Supplementary Data 1). Daughter trait deviations were the average trait deviations of a bull’s daughters and all phenotypes were pre-corrected for known fixed effects, with processing done by DataGene, Australia for the official release of National bull breeding values (http://www.datagene.com.au/). Only those bulls’ phenotypes which were based on records from more than 15 progeny were included. Phenotype data of up to 11,923 bulls included 6569 CRV bulls (https://www.crv4all-international.com/) with phenotypes derived from their Interbull MACE breeding values (https://interbull.org/ib/interbullactivities) degressed on the Australian scale and converted to the scale of the daughter trait deviation. The remaining 5354 bulls and all 32,347 cows were from DataGene. All these animals included Holstein (9739 ♂ / 22,899 ♀), Jersey (2059 ♂ / 6174 ♀), cross breed (0 ♂ / 2850 ♀) and Australian Red dairy breeds (125 ♂ / 424 ♀).

The genotypes for the above-described bulls and cows included a total of 17,669,372 imputed sequence variants with Minimac333,34 imputation accuracy R2 > 0.4. Genotypes were imputed using Run6 version of the 1000-bull genome data18,35 as the reference set, using Eagle36 to first phase the data and then Minimac3 for imputation. Sequence variants with minor allele frequency > 0.001 in 11,923 bulls and in 32,347 cows were used in the GWAS.

Phenotype decorrelation was conducted using Cholesky transformation, with the formula of $$C_n = L^{ - 1}g_n$$ (Eq. 9, published in ref. 9). Cn was a K (number of traits)×1 vector of Cholesky scores for animal n; L was the K × K lower triangular matrix of the Cholesky factor which satisfied LLt = COV, the K × K covariance matrix of raw scores after standardisation as z-scores, gn was a K × 1 vector of traits for animal n. Cholesky transformation decorrelated all traits at once. However, as a result, the Kth Cholesky-transformed trait can be interpreted as the Kth original trait corrected for the preceding K − 1 traits and each Cholesky-transformed trait had a variance of close to 1 (Supplementary Data 1). The reason for the choice of these 34 traits and the trait order to conduct Cholesky transformation was to fully utilise the dataset containing the varying number of phenotypic records. A free consideration of the trait order would require all traits to have complete records (e.g., ref. 9) and this would result in sample size for all traits being determined by the trait with the smallest number of records, i.e., 1439 bulls and 4086 cows. Another condition required to be met for the current Cholesky transformation was that the same descending order of the selected traits matches between the bull and the cow data sets. This allowed that CT traits were the same in bulls and cows such that the GWAS of one trait in two sexes were comparable.

### Single-trait GWAS

The 34 decorrelated traits were analysed one trait at a time independently in each sex with linear mixed models using GCTA37

$$y = {\rm{mean + breed}} + bx + a + {\rm{error}},$$
(10)

where y = vector of phenotypes for bulls or cows, breed = three breeds for bulls, Holstein, Jersey and Australian Red and four breeds for cows (Holstein, Jersey, Australian Red and MIX); bx = regression coefficient b on variant genotypes x; a = polygenic random effects N(0, g2) where G = genomic relatedness matrix based on all variants. The above GWAS model was also applied to estimate the total genomic variance for heritability calculations for all traits, but without including the “bx” term in the model.

### The conventional multi-trait meta-analysis of single-trait GWAS

The multi-trait χ2 statistic for the ith variant was calculated based on its signed t values generated from each single-trait GWAS: $$\chi ^2 = t\prime _iV^{ - 1}t_i$$ (Eq. 11, published in ref. 1). ti was a K (number of traits = 34) × 1 vector of the signed t values of varianti effects, i.e., b/se, for the K traits; ti′ was a transpose of vector ti(1 × K); and V−1 was an inverse of the K × K correlation matrix where the correlation was calculated over the all estimated variant effects (signed t values) of all trait pairs. The χ2 value of each variant was examined for significance based on a χ2 distribution with k degrees of freedom to test against the null hypothesis that the variant had no significant effects on any one of the Kth traits. The conventional FDR of χ2 tests were calculated following1.

To fully utilise the GWAS summary data obtained in bulls and cows separately, a weighted multi-trait meta-analysis of the signed t values of bull GWAS and cow GWAS results was performed using the formula

$$t_w = \frac{{b_w}}{{{\rm{se}}_w}} = \frac{{\frac{{\frac{{b_{{{\rm{bull}}}}}}{{{\rm{se}}_{{{\rm{bull}}}}^2}}\, +\, \frac{{b_{{\rm{cow}}}}}{{{\rm{se}}_{{\rm{cow}}}^2}}}}{{\frac{1}{{{\rm{se}}_{{{\rm{bull}}}}^2}}\, +\, \frac{1}{{{\rm{se}}_{{\rm{cow}}}^2}}}}}}{{\sqrt {\frac{1}{{\frac{1}{{{\rm{se}}_{{{\rm{bull}}}}^2}}\, +\, \frac{1}{{{\rm{se}}_{{\rm{cow}}}^2}}}}} }},$$
(11)

(published in ref. 12). Where tw was the weighted t value for bulls and cows accounting for the phenotypic error differences between bulls and cows11, calculated as the quotient of the weighted variant effects Bw and the weighted effect error sew; bbull and sebull were the variant effects and errors obtained from single-trait GWAS in bulls and bcow and secow were the variant effects and errors from cow GWAS. Once the tw of each trait was obtained, the multi-trait meta-analysis was performed using Eq. 11. This weighted meta-analysis approach was also used to analyse the GWAS summary statistics data of the 2017 study of cattle pleiotropy9, where over 630,003 SNPs from the high-density SNP chip panel were used for GWAS with 25 traits in bulls and cows.

### The sliding-window variant clustering

The use of sequence variants led to a large number of variants in high LD. Consequently, a single QTL might be tagged by many sequence variants. To give an indication of the differentiation of pleiotropic QTL from sequence variants, a sliding-window (5 Mb) variant clustering was conducted. All calculations regarding this clustering were done separately in each sex. Firstly, variants were ranked by their multi-trait p values adjusted (divided) by their functional-and-evolutionary trait heritability (FAETH) score13. Then, variants ranked within the top 10% within each sex were selected for LD pruning, using Plink 1.938 with r2 > 0.95 within 5 Mb windows. For pruned variants, in each 5 Mb window, with the overlapping size of 2.5 Mb, we calculated $$\rho _{ij} = r_{\rm{cor}}( {t_{{\rm{variant}}_i},t_{{\rm{variant}}_j}} ) \times r_{\rm{LD}}( {{\rm{variant}}_i,{\rm{variant}}_j} ),$$ where $$r_{\rm{cor}}( {t_{{\rm{variant}}_i},t_{{\rm{variant}}_j}} )$$ was the correlation between two vectors of t values (beta/s.e. from GWAS described above) of 34 CT traits for varianti and variantj; $$r_{\rm{LD}}\left( {{\rm{variant}}_i,{\rm{variant}}_j} \right)$$ was the LD (r) measured as the correlation between the genotypes of varianti and variantj. These ρ values were clustered using graph based Random Walks determining densely connected subgraphs (clusters)39 implemented in igraph40. On average there were 12 variants per cluster in bulls and cows. The clustering of the variants aimed at finding groups of variants tagging the same QTL. Normally, variants with strong LD were considered as tagging the same QTL. However, variants tagging one QTL should also show similar effect patterns across multiple traits (correlated GWAS t values). Therefore, by adding the correlation of effect t value, i.e., $$r_{\rm{cor}}( {t_{{\rm{variant}}_i},t_{{\rm{variant}}_j}})$$ on top of the $$r_{\rm{LD}}( {{\rm{variant}}_i,{\rm{variant}}_j} )$$, we expect the metric ρij to better group variants tagging the same QTL than using rcor or rLD alone. After clustering, up to 50 clusters were retrieved within each window. To verify if identified clusters can represent local QTL, the top 100 variants from clusters based on their FAETH adjusted multi-trait p value were selected. These 100 top variants were used to conduct a conditional analysis of GWAS37 of 34 traits in each sex. A meta-analysis of the conditional GWAS results used Eq. 11.

Meta-analysis of the conditional GWAS results showed that pleiotropic effects dramatically dropped, but only for those variants from the clusters with which the top 100 variants were selected (Supplementary Fig. 5). However, the pleiotropic effects of those variants from the clusters not used to select the top 100 variants did not show significant drops (Supplementary Fig. 5). These results support the hypothesis that, in general, pleiotropic QTL can be differentiated from sequence variants, but no doubt there were exceptions to this conclusion.

### The single-trait EDME model

The R code of EDME can be downloaded from https://figshare.com/s/42017bcae071d24639f4 and an online tutorial for conducting EDME of GWAS is available at: https://ruidongxiang.com/effect-direction-meta-analysis-edme-of-gwas/. In theory, there should be three categories of variants associated with a given trait: (1) variants with ‘true negative’ effects on the trait, i.e., T variants that truly decrease the trait value, (2) variants with ‘true positive’ effects on the trait, i.e., T+ variants that truly increase the trait value and (3) variants with ‘true zero’ effects on the trait, i.e., the T0 variants that had no effects on the trait. In practice, a GWAS conducted for the same trait in two different populations would have four categories of variant effect directions: (1) the ‘−−’ where the effect was consistently negative for the trait in two populations, e.g., bulls and cows in this study; (2) the ‘++’ where the effect was consistently positive for the trait in bulls and cows; (3) the ‘+−’ where the effect was positive for the trait in bulls but negative in cows; and (4) the ‘−+’ where the effect was negative for the trait in bulls but positive in cows.

If there were T0 variants which should have no TE, they were expected to be randomly distributed across the four observed categories (‘++’, ‘−−’, ‘+−’ and ‘−+’) of variants. This meant that each one of the four categories of the ‘++’, ‘−−’, ‘+−’ and ‘−+’ variants contain an equal amount of the T0 variants (i.e., one-fourth). Therefore, the total number of ‘+−’ and ‘−+’ variants observed was equal to one-half of the total number of T0 variants expected. This then allowed the deduction of model Eq. 1, as described in the section ‘Results’, to calculate the number of T0. Once the number of T0 was known, the total number of T and T+ variants can be determined by equating the expectation with the observed number of ‘++’ and ‘−−’ variants (n(B)) as described in the first and the third quadrant of the right panel of Fig. 2a

$$n\left( B \right) = T_ + + \frac{{T_0}}{4} + T_ - + \frac{{T_0}}{4} = T_ + + T_ - + \frac{{T_0}}{2},$$

then

$$n\left( B \right) - \frac{{T_0}}{2} = T_ - + T_ + = T_r.$$
(2)

Once the number of T0 and Tr were determined, the false discovery rate by effect direction (FDRed) and true discovery rate by effect direction (TDRed) can be calculated based on the total number of variants entered the analysis in the two populations (n(A) + n(B)) as described in Eqs. 3 and 4 (see Results section).

### The EDME model for multi-trait analysis

For equation $${\mathrm{{\Pi}}}_i = t_{i1} \cdot t_{i2}^T$$ (5), the Π value was calculated as the dot product of the vectors of t values of bulls and cows with the same length of K, where K is the total number of traits. The idea was that the effect direction across multiple traits, i.e., the sign of the Π value, was not driven by the small trait effects which were more likely to arise from noise. Instead, the sign of the Π value was driven by relatively large trait effects which were likely to be robust. Then, the number of variants with negative Π values were used to calculate the multi-trait FDRed in Eq. 6.

For one variant analysed in K traits across the two populations, the single-trait Eq. 2 can be extended as

$$n\left( B \right)_{\rm{MTR}} = \left( {P_{T_r} + \frac{{P_{T_0}}}{2}} \right) \times K,$$
(13)

where $$n\left( B \right)_{\rm{MTR}}$$ was the multi-trait version of n(B) that was equivalent to $$\mathop {\sum }\limits_1^K n\left( B \right)_{\rm{MTR}}$$ (in Eq. 7), where $$n\left( B \right)_{\rm{MTR}}$$ = 1 if the variant has the same effect direction and $$n\left( B \right)_{\rm{MTR}}$$ = 0 if the variant has different effect directions between the two populations. $$P_{T_0}$$ was the proportion of the total number of variants with zero effects (T0) and $$P_{T_r}$$ was the proportion of the total number of variants with TE (Tr). As $$P_{T_r}$$ + $$P_{T_0} = 1$$; Eq. 13 was solvable and can be re-arranged as

$$P_{T_r} = \frac{{2 \times n\left( B \right)_{\rm{MTR}}}}{K} - 1 = \frac{{2 \times \mathop {\sum }\limits_1^K n\left( B \right)_{\rm{MTR}}}}{K} - 1 = {\rm{multi}}{\hbox{-}}{\rm{trait}}\;{\rm{TDRed}},$$

then

$$\begin{array}{l}{\rm{multi}}{\hbox{-}}{\rm{trait}}\;{\rm{TDRed}} \times K = 2 \times \mathop {\sum }\limits_1^K n\left( B \right)_{\rm{MTR}}\, -\, K = {\rm{estimated}}\;{\rm{number}}\;{\rm{of}}\;{\rm{TE}} =Eq. \, 7\end{array},$$

where TE is the True Effect of a variant on traits. Note that if the number of traits considered was one (K = 1), the multi-trait Eq. 13 was then identical to single-trait Eq. 2 at the probability level. TDRedMTR was the multi-trait version of TDRed for Eq. 4.

For Eq. 8, when many variants were selected, the estimated number of TE averaged across a group of variants, i.e., $$\overline {\rm{TE}}$$, became stable. Thus, the number of variants with positive Π values n(variant) multiplied by the number of $$\overline {\rm{TE}}$$ will inform the TC of trait by variant, $${\rm{TC}}_{Tr \times V}$$. For single-trait GWAS of K traits, each one of the total X number (e.g., 17 mil) of variants would have K number (e.g., 34) of t values. Then, for a total X number of variants by K number of traits, a X × K length (e.g., 17 mil × 34) vector of t values can be obtained. If $${\rm{TC}}_{Tr \times V} = n\left( {{\rm{variant}}} \right) \times \overline {\rm{TE}}$$ = 80,000 × 10 = 800,000, then the top 800,000 combinations of trait by variant ranked by their t value (absolute value) in the X × K vector would be prioritised by Eq. 8. The unique variants with Π > 0 in these combinations of trait by variant were the prioritised pleiotropic variants.

Once a prioritised set of X × K TC of trait by variant was achieved, two additional pieces of information can be estimated. One was the count of the traits assigned to each variant in the X × K vector and this was the observed TE for each variant. The other was the count of the variants with positive Π assigned to each trait in the X × K vector and this was the identification of the pleiotropic variants related to each trait, variantstrait. See the online tutorial for using EDME to quantify pleiotropy and to do other related analyses at https://ruidongxiang.com/effect-direction-meta-analysis-edme-of-gwas/.

### Sharing of trait-related variants (variantstrait)

The number of pleiotropic variantstrait in overlaps between 34 CT trait sets were counted. The significance of overlaps were compared with the expected number using the hypergeometric test (p) using the variant-overlap matrices generated by GeneOverlap41 in R. This analysis required four types of counts: the size of overlap between set A (e.g., variants assigned to trait 1) and set B (e.g., variants assigned to trait 2), the size of set A, the size of set B and the size of background (92,537 prioritised by EDME). The number of variants in the overlap and in the union of pleiotropic variantstrait between traits was obtained. The number of overlapping dividing the number of the union was used to calculate the sharing index for each trait pair. For each trait, its sharing indices with the other 33 traits (excluding self-pair) were averaged and the averaged sharing index was then used to rank each trait.

### Conventional hierarchical cluster analysis

This clustering was based on the correlation matrices for weighted t values (Eq. 11) of variants between bulls and cows. The correlation matrices were used to perform hierarchical clustering and were shown as heatmaps using ‘complexHeatmap’42 in R v3.4.1.

### Variant annotation

Variants were annotated using the Variant Effect Prediction of Ensembl22 and NGS-SNP43 based on the Run 6 version of the 1000-bull genome data.

### Mendelian randomisation

This analysis was conducted using GSMR20. Summary statistics of GWAS of raw traits (daughter trait deviations, not Cholesky transformed) for those 93,513 variants prioritised by p value and multi-trait EDME were used as variant input for GSMR. Settings of MR followed the default options of GSMR, including the threshold of HEIDI outlier test being p < 0.01. Both forward and reverse analysis were conducted. GSMR generated MR p values were corrected for multi-testing by the Bonferroni method. A putative causal relationship was defined as the forward MR was significant and the reverse MR was not significant at Bonferroni corrected p < 0.05 level.

### Further analysis of variantstrait

For each set of variantstrait identified by the multi-trait analysis, their single-trait GWAS from bulls and cows were mapped and were analysed for single-trait FDRed using Eq. 3. Totally, 164 previously prioritised variants for cattle stature were retrieved from ref. 18 and their enrichment in each set of variantstrait was tested using the hypergeometric test. The p value of enrichment was adjusted for multi-testing.

### Joint analysis

The joint analysis was conducted using GCTA-COJO37 function. The model for the joint analysis was the same as Eq. 10 expect that selected variants were fitted together. The joint analysis was carried out for each one of the 37 raw traits in bulls and cows. The weighted t value was re-calculated based on the beta (‘bJ’) and se (‘seJ’) generated from the joint analysis for each trait using Eq. 11. Then the weighted t values were presented in Fig. 6a, b.

### Validation in the 1000-bull genome data

The 1000-bull genome database (Run 618,35) contained the whole genome sequence of cattle independent of the Australian data with no phenotype but with breed information. To perform a validation test without phenotype in the validation population, we used the genotypes of selected variants and 37 raw traits (Supplementary Data 11) of the Australian dairy cattle to generate genomic prediction equations separately in the bull and cow populations. Then we used these prediction equations to predict the phenotype of each 1000-bull individual (i.e., generating two sets of genomic estimated breeding values, gEBV, for every raw trait). If a variant selection was informative, the average gEBV of the 1000-bull genome dairy cattle (e.g., Holstein breed) was expected to be better than the average gEBV of 1000-bull genome beef cattle (e.g., Angus).

The genomic best linear unbiased prediction (gBLUP) implemented in MTG244 was used to train genomic prediction equations in each of the Australian bull and cow data sets. The genomic prediction equations were generated for three different variant sets: (1) prioritised by EDME and p value, (2) prioritised by p value alone and (3) a random set of variants with the matching variant number. These 3 sets of genomic prediction equations were generated for each of the 37 original traits (raw daughter trait deviations, not Cholesky transformed) in both the bull and cow data sets. The summary statistics of these 37 raw traits were given in Supplementary Data 11. The gBLUP model was the same as Eq. 10 except that no individual variants were fitted (i.e., removing “bx”). This estimated the total genetic value of Australian bulls and cows and allowed back-solving for the variant effect solutions in the Australian data (-sbv option). Then, the variant effect solutions were then used to calculate gEBV of the dairy and beef breed individuals in the 1000-bull genome (Run 6, N = 2334).

The dairy cattle in the 1000-bull genome included Holstein (N = 567), Jersey (N = 66), Brown Swiss (N = 148), Finnish Ayrshire (N = 25), Normandy (N = 44), Norwegian Red (N = 24), Swedish Red (N = 16). The 1000-bull beef cattle included Angus (N = 272), Beef Booster (N = 29), Belgian Blue (N = 16), Blonde d’Aquitaine (N = 26), Braunvieh Beef (N = 4), Charolais (N = 128), Hereford (N = 75), Limousin (N = 82), Maine Anjou (N = 5) and Simmental (N = 225). The mean of gEBV of the 1000-bull dairy and beef cattle were compared using a t test. For the t test of each trait between dairy and beef cattle, a t value was calculated as the mean difference (with reference level set to beef cattle) divided by the standard error and this t value was presented in Fig. 7.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The DNA sequence data as part of the 1000-bull genome project18,35,45 included PRJNA431934, PRJNA238491, PRJDB2660, PRJEB18113, PRJEB1829, PRJEB27309, PRJEB28191, PRJEB9343, PRJNA210519, PRJNA210521, PRJNA210523, PRJNA279385, PRJNA294709, PRJNA316122, PRJNA474946, PRJNA477833, PRJNA494431, PRJDA48395, PRJNA431934 and PRJNA238491. The variant FAETH score is available at https://doi.org/10.26188/5c5617c01383b. All other data supporting results and figures are shown in the Supplementary Data and Figures.

## Code availability

The GWAS was conducted using public software Plink38 and GCTA37. Variant annotation used public Variant Effect Prediction of Ensembl22 and NGS-SNP43. The code of EDME software implemented in R with sample datasets and demo workflow can be downloaded from https://figshare.com/s/42017bcae071d24639f4. An online tutorial showing how to use the code and sample data to conduct EDME of GWAS is available at https://ruidongxiang.com/effect-direction-meta-analysis-edme-of-gwas/.

## References

1. 1.

Bolormaa, S. et al. A multi-trait, meta-analysis for detecting pleiotropic polymorphisms for stature, fatness and reproduction in beef cattle. PLoS Genet. 10, e1004198 (2014).

2. 2.

Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2018).

3. 3.

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203 (2018).

4. 4.

Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50, 229 (2018).

5. 5.

Pickrell, J. K. et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 48, 709 (2016).

6. 6.

Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).

7. 7.

Jordan, D. M., Verbanck, M. & Do, R. HOPS: a quantitative score reveals pervasive horizontal pleiotropy in human genetic variation is driven by extreme polygenicity of human traits and diseases. Genome Biol. 20, 222 (2019).

8. 8.

Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693 (2018).

9. 9.

Xiang, R., MacLeod, I. M., Bolormaa, S. & Goddard, M. E. Genome-wide comparative analyses of correlated and uncorrelated phenotypes identify major pleiotropic variants in dairy cattle. Sci. Rep. 7, 9248 (2017).

10. 10.

Hemani, G., Bowden, J. & Davey Smith, G. Evaluating the potential role of pleiotropy in Mendelian randomization studies. Hum. Mol. Genet. 27, R195–R208 (2018).

11. 11.

Kemper, K. E. et al. Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions. Genet. Sel. Evol. 47, 29 (2015).

12. 12.

Xiang, R. et al. Genome variants associated with RNA splicing variations in bovine are extensively shared between tissues. BMC Genomics 19, 521 (2018).

13. 13.

Xiang, R. et al. Quantifying the contribution of sequence variants with regulatory and evolutionary significance to 34 bovine complex traits. Proc. Natl Acad. Sci. 116, 19398–19408 (2019).

14. 14.

MacLeod, I. et al. Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits. BMC Genom. 17, 144 (2016).

15. 15.

Littlejohn, M. D. et al. Sequence-based association analysis reveals an MGST1 eQTL with pleiotropic effects on bovine milk composition. Sci. Rep. 6, 25376 (2016).

16. 16.

Pausch, H. et al. Meta-analysis of sequence-based association studies across three cattle breeds reveals 25 QTL for fat and protein percentages in milk at nucleotide resolution. BMC Genom. 18, 853 (2017).

17. 17.

Sharma, V. et al. A genomics approach reveals insights into the importance of gene losses for mammalian adaptations. Nat. Commun. 9, 1215 (2018).

18. 18.

Bouwman, A. C. et al. Meta-analysis of genome-wide association studies for cattle stature identifies common genes that regulate body size in mammals. Nat. Genet. 50, 362 (2018).

19. 19.

Zheng, J. et al. Recent developments in Mendelian randomization studies. Curr. Epidemiol. Rep. 4, 330–345 (2017).

20. 20.

Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun. 9, 224 (2018).

21. 21.

Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).

22. 22.

McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

23. 23.

Gao, Y. et al. Species composition and environmental adaptation of indigenous Chinese cattle. Sci. Rep. 7, 16196 (2017).

24. 24.

Pareek, C. S. et al. Single nucleotide polymorphism discovery in bovine pituitary gland using RNA-seq technology. PLoS ONE 11, e0161370 (2016).

25. 25.

Maiorano, A. M. et al. Assessing genetic architecture and signatures of selection of dual purpose Gir cattle populations using genomic information. PLoS ONE 13, e0200694 (2018).

26. 26.

Kemper, K. et al. Leveraging genetically simple traits to identify small-effect variants for complex phenotypes. BMC Genom. 17, 858 (2016).

27. 27.

Ibeagha-Awemu, E. M., Peters, S. O., Akwanji, K. A., Imumorin, I. G. & Zhao, X. High density genome wide genotyping-by-sequencing and association identifies common and low frequency SNPs, and novel candidate genes influencing cow milk traits. Sci. Rep. 6, 31109 (2016).

28. 28.

Hunt, L. E. et al. Complete re-sequencing of a 2Mb topological domain encompassing the FTO/IRXB genes identifies a novel obesity-associated region upstream of IRX5. Genome Med. 7, 126 (2015).

29. 29.

Harmston, N. et al. Topologically associating domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation. Nat. Commun. 8, 441 (2017).

30. 30.

Stratigopoulos, G. et al. Hypomorphism of Fto and Rpgrip1l causes obesity in mice. J. Clin. Investig. 126, 1897–1910 (2016).

31. 31.

Xiang, R., Oddy, V. H., Archibald, A. L., Vercoe, P. E. & Dalrymple, B. P. Epithelial, metabolic and innate immunity transcriptomic signatures differentiating the rumen from other sheep and mammalian gastrointestinal tract tissues. PeerJ 4, e1762 (2016).

32. 32.

Flint, J. & Eskin, E. Genome-wide association studies in mice. Nat. Rev. Genet. 13, 807 (2012).

33. 33.

Fuchsberger, C., Abecasis, G. R. & Hinds, D. A. minimac2: faster genotype imputation. Bioinformatics 31, 782–784 (2014).

34. 34.

Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955 (2012).

35. 35.

Daetwyler, H. D. et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46, 858 (2014).

36. 36.

Loh, P.-R. et al. Reference-based phasing using the haplotype reference consortium panel. Nat. Genet. 48, 1443 (2016).

37. 37.

Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).

38. 38.

Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

39. 39.

Pons, P. & Latapy, M. in International Symposium on Computer and Information Sciences. 284–293 (Springer, 2005).

40. 40.

Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJ. Complex Syst. 1695, 1–9 (2006).

41. 41.

Shen, L. GeneOverlap: An R package to test and visualize gene overlaps. R Package (2014).

42. 42.

Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).

43. 43.

Grant, J. R., Arantes, A. S., Liao, X. & Stothard, P. In-depth annotation of SNPs arising from resequencing projects using NGS-SNP. Bioinformatics 27, 2300–2301 (2011).

44. 44.

Lee, S. H. & Van der Werf, J. H. MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics 32, 1420–1422 (2016).

45. 45.

Hayes, B. J. & Daetwyler, H. D. 1000 Bull genomes project to map simple and complex genetic traits in cattle: applications and outcomes. Annu. Rev. Anim. Biosci. 7, 89–102, https://doi.org/10.1146/annurev-animal-020518-115024 (2019).

## Acknowledgements

Australian Research Council’s Discovery Projects (DP160101056) supported R.X. and M.E.G. DairyBio (a joint venture project between Agriculture Victoria and Dairy Australia) funded computing resources used in the analysis. I.B. was supported by the Center for Genomic Selection in Animals and Plants (GenSAP) funded by Innovation Fund Denmark (grant 0603-00519B). No funding bodies participated in the design of the study nor analysis, or interpretation of data nor in writing the manuscript. DataGene and CRV (www.crv4all-international.com/) provided access to data used in this study. We thank Gert Nieuwhof, Kon Konstantinov and Timothy P. Hancock (DataGene) and Chris Schrooten (CRV) for preparation and provision of data. We thank partners from the 1000-bull genome project for the data access. We thank Dr. Mekonnen Haile-Mariam for deriving the deregressed phenotypes from international MACE and Dr. Sunduimijid Bolormaa for sequence variant data imputation.

## Author information

Authors

### Contributions

M.E.G. and R.X. conceived the study. R.X. and I.B. implemented the analysis. I.M.M. and H.D.D. provided data and assisted with study design. R.X., I.B. and M.E.G. analysed the data. R.X. developed the EDME software. R.X. and M.E.G. wrote the paper. R.X., M.E.G., I.B., I.M.M. and H.D.D revised the paper. All authors read and approved the final paper.

### Corresponding author

Correspondence to Ruidong Xiang.

## Ethics declarations

### Competing interests

The authors declare no conflict of interest.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Xiang, R., van den Berg, I., MacLeod, I.M. et al. Effect direction meta-analysis of GWAS identifies extreme, prevalent and shared pleiotropy in a large mammal. Commun Biol 3, 88 (2020). https://doi.org/10.1038/s42003-020-0823-6

• Accepted:

• Published:

• ### Improving Genomic Prediction of Crossbred and Purebred Dairy Cattle

• Majid Khansefid
• , Michael E. Goddard
• , Mekonnen Haile-Mariam
• , Kon V. Konstantinov
• , Chris Schrooten
• , Gerben de Jong
• , Erica G. Jewell
• , Erin O’Connor
• , Jennie E. Pryce
• , Hans D. Daetwyler
•  & Iona M. MacLeod

Frontiers in Genetics (2020)

• ### Meta-analysis for milk fat and protein percentage using imputed sequence variant genotypes in 94,321 cattle from eight cattle breeds

• Irene van den Berg
• , Ruidong Xiang
• , Janez Jenko
• , Hubert Pausch
• , Mekki Boussaha
• , Chris Schrooten
• , Thierry Tribout
• , Arne B. Gjuvsland
• , Didier Boichard
• , Øyvind Nordbø
• , Marie-Pierre Sanchez
•  & Mike E. Goddard

Genetics Selection Evolution (2020)