We performed a scan for genetic variants associated with multiple phenotypes by comparing large genome-wide association studies (GWAS) of 42 traits or diseases. We identified 341 loci (at a false discovery rate of 10%) associated with multiple traits. Several loci are associated with multiple phenotypes; for example, a nonsynonymous variant in the zinc transporter SLC39A8 influences seven of the traits, including risk of schizophrenia (rs13107325: log-transformed odds ratio (log OR) = 0.15, P = 2 × 10−12) and Parkinson disease (log OR = −0.15, P = 1.6 × 10−7), among others. Second, we used these loci to identify traits that have multiple genetic causes in common. For example, variants associated with increased risk of schizophrenia also tended to be associated with increased risk of inflammatory bowel disease. Finally, we developed a method to identify pairs of traits that show evidence of a causal relationship. For example, we show evidence that increased body mass index causally increases triglyceride levels.
The observation that a genetic variant affects multiple phenotypes (a phenomenon often called 'pleiotropy' (refs. 1,2,3), although we will not use this term) is informative in a number of applications. One such application is learning about the molecular function of a gene. For example, men with cystic fibrosis (primarily known as a lung disease) are often infertile because of congenital absence of the vas deferens; this is evidence of a shared role for the CFTR protein in lung function and the development of reproductive organs4. Another application is learning about the causal relationships between traits. For example, individuals with congenital hypercholesterolemia also have elevated risk of heart disease5; this is now interpreted as evidence that changes in lipid levels causally influence heart disease risk6.
In these two applications, the same observation—that a genetic variant influences two traits—is interpreted in fundamentally different ways depending on known aspects of biology. In the first case, a genetic variant influences two phenotypes through independent physiological mechanisms (graphically, P1 ← G → P2, if G represents the genotype, P1 the first phenotype, and P2 the second phenotype and the arrows represent causal relationships7), whereas, in the second case, the effect of the variant on the second trait is mediated through its effect on the first trait, G → P1 → P2. In some situations, knowing which interpretation of the observation to prefer is simple: for example, it seems difficult to imagine how the reproductive and lung phenotypes of a CFTR mutation could be related in a causal chain. In other situations, interpretation is considerably more challenging. For example, the causal connections between various lipid phenotypes and heart disease have been debated for decades (for example, see ref. 8).
As the number of reliable associations between genetic variants and various phenotypes has grown over the last decade9, these issues have received increasing attention. A number of recent studies have identified genetic variants associated with multiple traits10,11,12,13,14,15,16,17,18,19,20; in general, these associations are interpreted as most plausibly due to the independent effects of a genetic variant on different aspects of physiology. For example, a genetic variant in LGR4 is associated with bone mineral density (BMD), age at menarche, and risk of gallbladder cancer16, presumably owing to effects mediated through different tissues.
There has also been increasing interest in the alternative, causal framework for interpreting genetic variants that influence multiple phenotypes, which has been formalized under the name 'Mendelian randomization' (refs. 21,22,23). Mendelian randomization has been used to provide evidence for (or against) a causal role for various clinical variables in disease etiology24,25,26,27,28,29,30. For example, genetic variants associated with body mass index (BMI) are also associated with type 2 diabetes27; this is consistent with a causal role for weight gain in the etiology of diabetes.
Thus far, most studies of multiple traits have been performed across the genome on groups of traits already known or hypothesized to be related10,31,32,33 or via testing small sets of variants for effects on a wide range of traits20,34. We aimed to systematically perform a genome-wide search for genetic variants that influence pairs of traits and then to interpret these associations in light of the causal and non-causal models described above. In this paper, we describe the results of such a search using large GWAS of 42 traits.
We assembled summary statistics from 43 GWAS of 42 traits or diseases performed in individuals of European descent (Table 1; 2 of these GWAS were for age at menarche). These studies span a wide range of phenotypes, from anthropometric traits (for example, height, BMI, and nose size) to neurological disease (for example, Alzheimer disease and Parkinson disease) to susceptibility to infection (for example, childhood ear infections and tonsillectomy). Seventeen of these GWAS were performed by the personal genomics company 23andMe and have not previously been reported (for details of these studies, see Supplementary Data 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17). For studies that were not done using imputation to all variants in Phase 1 of the 1000 Genomes Project35, we performed imputation at the level of summary statistics with ImpG v1.0 (ref. 36). We estimated the approximate number of independent associated variants (at a false discovery rate (FDR) of 10%) in each study using fgwas v.0.3.6 (ref. 37). The number of associations ranged from around 5 (for age at voice drop in men) to over 500 (for height).
Identification of genetic variants that influence pairs of traits
We first aimed to identify genetic variants that influence pairs of traits. To do this, we developed a statistical model (extending that used by Giambartolomei et al.38) to estimate the probability that a given genomic region (i) contains a genetic variant that influences the first trait (model 1); (ii) contains a genetic variant that influences the second trait (model 2); (iii) contains a genetic variant that influences both traits (model 3); or (iv) contains both a genetic variant that influences the first trait and a separate genetic variant that influences the second trait (model 4) (Fig. 1). The input to the model is the set of summary statistics (effect size estimates and standard errors) for each SNP in the genome on each of the two phenotypes, and (if the two GWAS were performed on overlapping sets of individuals) the expected correlation in the summary statistics due to correlation between the phenotypes. We can then fit the following log likelihood function
where D is the data, M is the number of approximately independent blocks in the genome, Π0 is the prior probability that a region contains no genetic variants that influence either trait, Π1, Π2, Π3, and Π4 represent the prior probabilities of the four models described above, Θ is the set of all five Π parameters, and RBFi(j) is the regional Bayes factor measuring the support for model j in genomic region i (see the Supplementary Note for details). In the presence of missing data, we consider only the subset of SNPs with data in both studies; if the causal SNP is not present, this acts to reduce power to detect a shared effect38. In fitting this model, we estimate the prior parameters and the posterior probability of each model for each region of the genome (for numerical stability, in practice, we penalize the estimates of the prior parameters and so obtain maximum a posteriori estimates). We were mainly interested in the estimated prior probability that each genomic region contains a variant that influences both traits () and the corresponding posterior probabilities for each genomic region.
Several caveats of this method are worth mentioning. First, note that the estimate is best thought of as the proportion of genomic regions that detectably influence both traits—if one study is small and underpowered, this estimate will necessarily be zero. This approach contrasts with methods that aim to provide unbiased estimates of the 'genetic correlation' between traits, which do not depend on sample size39,40,41. Second, in general, it is not possible to distinguish a single causal variant that influences both traits (model 3 in Fig. 1) from two separate causal variants (model 4 in Fig. 1) in the presence of strong linkage disequilibrium (LD) between the causal variants. For any individual genomic region discussed below, the possibility of two highly correlated causal variants must be considered as an alternative possibility in the absence of functional follow-up. (Indeed, this latter possibility appears to be common in quantitative trait locus studies performed in model organisms42.) Finally, we evaluated the method in simulations (Supplementary Figs. 1–5) and found that the model gives a small overestimate of the proportion of shared effects (Supplementary Fig. 3). This is because the amount of evidence against the null model of no associations is greater when a variant influences both phenotypes as compared to when it only influences a single phenotype (Supplementary Fig. 4).
Overlapping association signals identified in 43 GWAS
We applied the method to all pairs of the 43 GWAS listed in Table 1. For each pair of studies, we first estimated the expected correlation in the effect sizes from the summary statistics and included this correction for overlapping individuals in the model. Note that this is conservative: in pairs of GWAS where we are sure that there are no overlapping individuals (for example, age at menarche and age at voice drop), we saw that the correlation in the summary statistics was nonzero, indicating that we are correcting out some truly shared genetic effects on the two traits (Supplementary Fig. 6).
To gain an exploratory sense of the relationships between the phenotypes, we examined the patterns of overlap in associations among all 43 studies. Specifically, the model can be used to estimate, for each pair of traits [i,j], the proportion of detected variants that influence trait i that also detectably influence trait j. These estimates are shown in Figure 2, with phenotypes clustered according to their patterns of overlap. We see several clusters of related traits. For example, of the variants that detectably influence age at menarche (in the study by Perry et al.43), the maximum a posteriori estimate is that 36% detectably influence height, 30% detectably influence age at voice drop, 28% influence BMI, 10% influence breast size, and 10% influence male-pattern baldness. We interpret this as a set of phenotypes that share hormonal regulation. Additionally, there is a large cluster of phenotypes including coronary artery disease (CAD), type 2 diabetes, red blood cell traits, and lipid traits, which we interpret as a set of metabolic traits. Further, immune-related disease (allergies, asthma, hypothyroidism, Crohn's disease, and rheumatoid arthritis) all cluster together and also cluster with infectious disease traits (childhood ear infections and tonsillectomy). This biologically relevant clustering validates the principle that GWAS variants can identify shared mechanisms underlying pairs of traits in a systematic way. As a control, we performed the same clustering of phenotypes by the estimated proportion of genomic regions where two causal sites fall nearby (model 4 in Fig. 1). In this case, there was no biologically meaningful clustering (Supplementary Fig. 7).
Individual loci that influence many traits
We next examined the individual loci identified by these pairwise GWAS. We identified 341 genomic regions where we infer the presence of a variant that influences a pair of traits, at a threshold of a posterior probability greater than 0.9 of model 3 (Supplementary Table 1). This number excludes 'trivial' findings where a genetic variant influences two similar traits (two lipid traits, two red blood cell traits, two platelet traits, both measures of BMD, both inflammatory bowel diseases, or type 2 diabetes and fasting glucose) and the MHC region. A previous 'phenome-wide association study' identified 44 genetic variants associated with multiple phenotypes34, so this represents an order of magnitude increase in the number of such loci.
Some genomic regions contain variants that influence a large number of the traits we considered. We ranked each genomic region according to how many phenotypes share genetic associations in the region (that is, if the pairwise scan for both height and CAD and the pairwise scan for CAD and LDL both indicated the same region, we counted this as three phenotypes sharing an association in the region). The top region in this ranking identified a nonsynonymous polymorphism in SH2B3 (rs3184504) that is associated with a number of autoimmune diseases, lipid traits, heart disease, and red blood cell traits (Supplementary Fig. 8 and Supplementary Table 2). This variant has been identified in many GWAS, particularly for autoimmune diseases44.
The next region in this ranking contains the gene encoding the ABO histo-blood groups in humans and has a variant associated with 11 traits in these data (and many other additional traits not in these data; see also refs. 20,45,46,47). In Figure 3a, we show the association statistics in this region for CAD and probability of having a tonsillectomy. At the lead SNP, the non-reference allele is associated with increased risk of CAD (z = 5.7, P = 1.1 × 10−8) and increased risk of having a tonsillectomy (z = 6.0, P = 1.5 × 10−9). This variant is also strongly associated with other immune, red blood cell, and lipid traits in these data (Fig. 3b). A tag for a microsatellite that influences the expression of ABO48 is correlated with the lead SNP rs635634, as is a tag for the O blood group (Fig. 3a). However, the lead SNP is an expression quantitative trait locus (eQTL) for both ABO and the nearby gene SLC2A6 in whole blood46, so this allele may in fact have downstream effects via effects on the expression of two genes.
Among the top ranked regions were several where the likely causal variant is known: (i) a nonsynonymous variant in the zinc transporter SLC39A8 (rs13107325; Supplementary Fig. 9) that is associated with schizophrenia (log OR for the non-reference allele = 0.15, P = 2 × 10−12), Parkinson disease (log OR = −0.15, P = 1.6 × 10−7), and height ( = −0.03 s.d., P = 3.8 × 10−7), among others; (ii) a nonsynonymous variant in the glucokinase regulator GCKR (rs1260326; Supplementary Fig. 10) that is associated with fasting glucose levels ( = 0.06 s.d., P = 5 × 10−25) and height ( = 0.019 s.d., P = 2.6 × 10−11), among others; (iii) a set of variants near the APOE gene (which we presume to be driven by the APOE4 allele; Supplementary Fig. 11) that is associated with nearsightedness (rs6857: log OR = −0.04, P = 1.8 × 10−5), waist–hip ratio ( = −0.02 s.d., P = 8.3 × 10−5), and several lipid traits apart from the well-known association with Alzheimer disease; and (iv) regulatory variants in an intron of the FTO gene49,50 that are associated with breast size in women (rs1421085: = 0.06 s.d., P = 3.5 × 10−7; Supplementary Fig. 12) and age at voice drop in men ( = −0.02 s.d., P = 2.7 × 10−5), among others.
It has previously been observed that association signals for different phenotypes tend to cluster spatially in the genome51; these results suggest that, in some cases, clustered associations are driven by single variants. We note anecdotally that the variants that influence a large number of phenotypes often seem to be nonsynonymous rather than regulatory changes, which contrasts with the pattern seen in association studies overall (for example, see ref. 37).
Identifying pairs of phenotypes with correlated effect sizes
In our scan for variants that influence pairs of phenotypes, we did not assume any relationship between the effect sizes of a variant on the two phenotypes. However, if two traits are influenced by shared underlying molecular mechanisms, we might expect the effects of a variant on the two phenotypes to be correlated. To test this hypothesis, we returned to the set of variants identified by analysis of each phenotype individually (the numbers of these variants for each trait are given in Table 1). For each set, we calculated the rank correlation between the effect sizes of the variants on the index trait (the one in which the variants were identified) and all of the other traits.
The results of this analysis are presented in Figure 4. Apart from closely related traits (for example, the two measurements of bone density), we saw a number of traits that were correlated at a genetic level. We focus on two of these. First, variants associated with delayed age of menarche in women tend, on average, to be associated with decreased BMI (ρ = −0.53, P = 1.2 × 10−6), reduced risk of male-pattern baldness (ρ = −0.45, P = 5.9 × 10−5), and increased height (ρ = 0.52, P = 2.2 × 10−6; Fig. 4). These patterns held both for the GWAS on age at menarche performed by Perry et al.43 and that performed by 23andMe (Fig. 4). Most of these variants also delay age at voice drop in men (Fig. 2), so we interpret these variants as ones that influence pubertal timing in general. The negative correlation between a variant's effect on age at menarche and BMI has previously been observed39,43,52, as has the positive correlation between a variant's effect on age at menarche and height39,43. The negative correlation between a variant's effect on age at menarche (or, more likely, puberty in general) and male-pattern baldness has not been previously noted but is consistent with the known role for increased androgen signaling in causing hair loss53,54,55.
Second, we found that genetic variants associated with increased risk of schizophrenia tended to be associated with increased risk of both Crohn's disease (ρ = 0.27, P = 2.2 × 10−4) and ulcerative colitis (ρ = 0.33, P = 6.6 × 10−6). These correlations (identified only at the most strongly associated SNPs) are also present at the level of genome-wide genetic correlations between the diseases39 (Supplementary Fig. 13). This observation is consistent with slightly higher rates of autoimmune diseases (including Crohn's disease and ulcerative colitis) in patients with schizophrenia in Denmark56,57,58 and with molecular evidence for a partial autoimmune etiology for schizophrenia (for example, see ref. 59).
Inferring causal relationships between traits
Finally, we were interested in identifying pairs of traits that may be related in a causal manner. Because we are using observational data (rather than, for example, a randomized controlled trial), we view strong statements about causality as impossible. Nonetheless, a realistic goal might be to identify aspects of the data that are more consistent with a causal model than a non-causal model.
As a motivating example, we considered the correlation between levels of LDL cholesterol and risk of CAD, now widely accepted as a causal relationship60. We noticed that variants ascertained as having an effect on LDL cholesterol levels had correlated effects on risk of CAD (Figs. 4 and 5c), whereas variants ascertained as having an effect on CAD risk did not in general have correlated effects on LDL levels (Fig. 5d). This is consistent with the hypothesis that LDL cholesterol is one of many causal factors that influence CAD risk. An alternative interpretation is that LDL cholesterol is highly genetically correlated to an unobserved trait that causally influences risk of CAD.
We developed a method to detect pairs of traits that show this asymmetry in the effect sizes of associated variants, which we interpret as more consistent with a causal relationship between the traits than a non-causal one (Online Methods). At a threshold of a relative likelihood of 100 in favor of a causal versus a non-causal model, we identified five pairs of putative causally related traits. (At a less stringent threshold of a relative likelihood of 20 in favor of a causal model, we identified 11 additional pairs of traits (Supplementary Fig. 14).) Simulations suggest that this threshold corresponds approximately to a P value around 0.001 (Supplementary Fig. 15) and that the power of this test depends on the number of genetic variants used as input and the true underlying correlation in their effect sizes (Supplementary Fig. 16). Four of these are shown in Figure 5. First, genetic variants that influence BMI had correlated effects on triglyceride levels, whereas the reverse was not true; this suggests that increased BMI is a cause for increased triglyceride levels (Fig. 5). Randomized controlled trials of weight loss are also consistent with this causal link61,62, as are Mendelian randomization studies63,64. Second, we confirmed the evidence in favor of a causal role for increased LDL cholesterol levels in CAD (Fig. 5) and in favor of a causal role for increased BMI in type 2 diabetes risk (Fig. 5 and Supplementary Fig. 17). Finally, we suggest that increased risk of hypothyroidism causes decreased height (Fig. 5). Although it is known that severe hypothyroidism in childhood leads to decreased adult height (for example, see ref. 65), these data indicated that hypothyroidism susceptibility may also influence height in the general population. A fifth potentially causal relationship (between risk of CAD and rheumatoid arthritis) could not be confirmed in a larger study and so is not displayed (Supplementary Fig. 18 and Supplementary Note).
We have performed a scan for genetic variants that influence multiple phenotypes and have identified several hundred loci that influence multiple traits. This style of scan complements methods to quantify the genetic correlation between two traits39,41,66,67, which are not generally concerned with identifying individual variants that influence both traits. We were interested in using the individual variants found to affect multiple traits to identify biological relationships between traits, including potential relationships where one trait is causally upstream of the other. Other potential mechanisms that could lead to an association between a genetic variant and two phenotypes include transgenerational effects for a variant, with one effect on a parental phenotype and an effect on a separate phenotype in the offspring (for example, see refs. 68,69), or assortative mating that involves more than one trait70.
A number of limitations of this study are worth mentioning. First, all of the GWAS we have used are based on genotyping arrays and imputation, and thus the loci identified are generally common (minor allele frequency over 1%). Inferences from common variants such as these may not hold for rarer variants that may emerge from large sequencing studies. Second, we reiterate that all of our inferences are based on sets of 'detectable' loci; the GWAS we have used have highly variable sample sizes, and the traits have variable genetic architectures. As sample sizes for all traits reach the millions, inferences from detectable loci will converge to inferences from all loci. If traits truly follow an infinitesimal model (where every genetic variant influences every trait), we speculate that patterns of genetic overlap (such as those in Fig. 2) will become less interpretable, while patterns of genetic correlation (such as those in Fig. 4) may be more useful.
One clear observation from these data is that genetic variants that influence puberty (age at menarche and age at voice drop) often have correlated effects on BMI, height, and male-pattern baldness (Fig. 4). In our scan for causal relationships between traits, we found modest evidence of a causal role of age at menarche in influencing adult height and for a causal role of BMI in the development of male-pattern baldness (Supplementary Fig. 12). The non-causal alternative (also consistent with the data) is that all of these traits are influenced by some of the same underlying biological pathways, and perhaps the most likely candidate for this pathway is hormonal signaling. This highlights the importance of considering evidence from multiple traits when interpreting the molecular consequences of a variant and designing experimental studies. Although variants that influence height overall are enriched near genes expressed in cartilage71 and variants that influence BMI are enriched near genes expressed broadly in the central nervous system72, it seems that a subset of these variants also influence age at menarche and male-pattern baldness. For these variants, it may be worth considering functional follow-up in gonadal tissues or specific brain regions known to be important in hormonal signaling.
It is also striking to note how many genetic variants influence multiple traits (Fig. 2) but without a consistent correlation in effect sizes (Fig. 4). For example, many of the autoimmune and immune-related traits appear to have many genetic causes in common, but the effect sizes of the variants on the different traits seem to be largely uncorrelated (see also refs. 10,39). Likewise, many variants appear to influence lipid traits, red blood cell traits, and immune traits, but without consistent directions of effect. A trivial explanation for this observation is that we are underpowered to detect correlations in effect sizes because we are using only a small set of the SNPs with the strongest associations. However, the genetic correlations between many of these traits (calculated using all SNPs) are not significantly different from zero39 (Supplementary Fig. 13). Another possibility is that a given genetic variant often influences the function of multiple cell types through separate molecular pathways or that the effects of a variant on two related phenotypes vary according to an individual's environmental exposures.
From the point of view of epidemiology, the ability to scan through many pairs of traits to find those that are potentially causally related seems appealing, and some previous analyses have had similar goals73. Our approach makes the key assumption that, if two traits are related in a causal manner, then the 'causal' trait is one of many factors that influence the 'caused' trait. This results in an asymmetry in the effects of genetic variants on the two traits that can be detected (Fig. 5). We also assume that we have identified a modest number of variants that influence both traits. This naturally means we are limited to considering heritable traits that have been studied within cohorts with moderate sample sizes (on the order of tens to hundreds of thousands of individuals). It seems likely that the main limiting factor to scaling this approach (should it be generally useful) will be phenotyping rather than genotyping.
The sources of the GWAS data analyzed in this study are described in detail in the Supplementary Note. For each study, we imputed summary statistics or genotypes for all autosomal variants in the March 2012 release of the 1000 Genomes Project Phase 1 (ref. 35). Our method uses the z scores and standard errors of the estimated effect sizes for each SNP. In studies where standard errors were not provided, we approximated them using the allele frequencies from the European-descent individuals in the 1000 Genomes Project Phase 1 release and the reported sample size of the study (see ref. 37). Throughout the paper, we report effect sizes of variants as the effect of the non-reference allele in human genome reference hg19.
The hierarchical model used for the main scan for overlapping association signals in two GWAS data sets is described in detail in the Supplementary Note. Software implementing the model is available through GitHub (see URLs).
We aimed to develop a robust method for measuring the evidence in favor of a causal relationship between two traits using data from many genetic associations, while recognizing that strong conclusions are likely impossible in this setting. The approach we developed is described in detail in the Supplementary Note.
This work was supported in part by the National Human Genome Research Institute of the National Institutes of Health (grant R44HG006981 to 23andMe) and the National Institute of Mental Health (grant R01MH106842 to J.K.P.). We thank the customers of 23andMe for making this work possible, the GWAS consortia that made summary statistics available to us, L. Jostins for providing updated summary statistics from the Crohn's disease and ulcerative colitis GWAS, and G. Coop and M. Stephens for helpful discussions. We thank D. Golan and J. Pritchard for comments on a previous version of this manuscript. We thank D. Cesarini and the Social Science Genetic Association Consortium for access to summary statistics from the association study of educational attainment.
Data on glycemic traits have been contributed by MAGIC investigators and have been downloaded from http://www.magicinvestigators.org/. Data on CAD and myocardial infarction have been contributed by CARDIoGRAMplusC4D investigators and have been downloaded from http://www.cardiogramplusc4d.org/.
We thank the International Genomics of Alzheimer's Project (IGAP) for providing summary results data for these analyses. The investigators within IGAP contributed to the design and implementation of IGAP and/or provided data but did not participate in analysis or writing of this report. IGAP was made possible by the generous participation of the control subjects, the patients, and their families. The iSelect chips were funded by the French National Foundation on Alzheimer disease and related disorders. EADI was supported by LABEX (Laboratory of Excellence program investment for the future) DISTALZ grant, INSERM, Institut Pasteur de Lille, Université de Lille 2, and the Lille University Hospital. GERAD was supported by the Medical Research Council (grant 503480), Alzheimer's Research UK (grant 503176), the Wellcome Trust (grant 082604/2/07/Z), and German Federal Ministry of Education and Research (BMBF): Competence Network Dementia (CND) grants 01GI0102, 01GI0711, and 01GI0420. CHARGE was partly supported by NIH/NIA grant R01 AG033193 and NIA grant AG081220 and AGES contract N01-AG-12100, NHLBI grant R01 HL105756, the Icelandic Heart Association, and the Erasmus Medical Center and Erasmus University. ADGC was supported by NIH/NIA grants U01 AG032984, U24 AG021886, and U01 AG016976, and by Alzheimer's Association grant ADGC-10-196728.
Genomic regions that contain a variant that influences more than one trait.