Phenome-wide association studies (PheWAS) have been proposed as a possible aid in drug development through elucidating mechanisms of action, identifying alternative indications, or predicting adverse drug events (ADEs). Here, we select 25 single nucleotide polymorphisms (SNPs) linked through genome-wide association studies (GWAS) to 19 candidate drug targets for common disease indications. We interrogate these SNPs by PheWAS in four large cohorts with extensive health information (23andMe, UK Biobank, FINRISK, CHOP) for association with 1683 binary endpoints in up to 697,815 individuals and conduct meta-analyses for 145 mapped disease endpoints. Our analyses replicate 75% of known GWAS associations (P < 0.05) and identify nine study-wide significant novel associations (of 71 with FDR < 0.1). We describe associations that may predict ADEs, e.g., acne, high cholesterol, gout, and gallstones with rs738409 (p.I148M) in PNPLA3 and asthma with rs1990760 (p.T946A) in IFIH1. Our results demonstrate PheWAS as a powerful addition to the toolkit for drug discovery.
The discovery and development of novel therapeutics is difficult. It may take 15 years to advance a new molecular entity from therapeutic hypothesis to approval, with development costs in the billion dollar range and only a 10% chance of a new drug tested in humans eventually getting approval1. Two reasons stand out to explain the high failure rate of clinical trials and receding return on R&D investment across the pharmaceutical industry: a lower efficacy of the compound in the targeted disease population than anticipated from preclinical studies; and the occurrence of unintended drug effects, particularly mechanism-based adverse drug events (ADEs) uncovered only in late-stage clinical trials2. A greater understanding of human data relevant to the drug target at early stages of drug development is generally considered to increase the probability of success1,3,4.
Resources that systematically capture biomedical information on vast numbers of individuals are revolutionizing our ability to understand the complexities of human biology and morbidity. Electronic health records (EHRs) and other resources that systematically capture extensive health information have rapidly become well-established tools for epidemiological and post-marketing research5,6. Recently, a surge of initiatives are seeking to link such phenotype resources with genome-scale genetic data in order to gain further insights into the genetics of common diseases7,8,9,10,11,12,13,14.
An attractive approach to help accelerate drug development utilizing these genotype–phenotype resources is through applying phenome-wide association studies (PheWAS). PheWAS are an unbiased approach to test for associations between a specific genetic variant, or, more recently, combination of variants, and a wide range of phenotypes in large numbers of individuals7,15,16. By exploring the associations of a genetic variant that impacts the function of a drug target gene, PheWAS in disease-agnostic cohorts with extensive health information may enrich the drug discovery process for five reasons: (1) association studies in disease-agnostic cohorts may validate target-disease links in cohorts that more closely resemble the real-world, i.e., the patients that will ultimately receive a drug;17 (2) by unraveling pleiotropy, PheWAS may improve our understanding of the biological functions of a target, or hint at concealed pathophysiological connections between disease entities previously considered as distinct;18,19 (3) PheWAS may reveal opportunities for drug repurposing, an attractive alternative to de novo drug development;20,21 (4) PheWAS may point to phenotypes that associate with an inverse directionality of target function, thus unraveling potential ADEs at very early stages of a developmental program, minimize risks to trial participants, and help define the most appropriate patient populations to benefit from a drug;21 and (5) through quantitative estimates from genetic safety and efficacy profiles, PheWAS may help prioritize multiple possible targets by identifying the target with the most promising therapeutic window. Despite these benefits, the ability for PheWAS to substantially add to the decision making in drug development is thwarted by the difficulty to obtain and systematize comprehensive genotypes and phenotypes across very large numbers of individuals.
Here, we test the hypothesis that PheWAS can inform target validation at early stages of drug discovery. We select candidate drug targets across a range of therapeutic indications based on their support from genome-wide association studies (GWAS). To maximize power, we map a large spectrum of clinical endpoints from four of the world’s largest disease-agnostic cohorts with extensive health information (23andMe, UK Biobank interim release, FINRISK and CHOP) and conduct association testing in up to 697,815 individuals. We validate the top associations in the extended UK Biobank cohort (337,199 participants), and apply conditional analyses and co-localization methods to identify true pleiotropy predicting drug efficacy or safety signals. Our results show that PheWAS, despite limitations, enrich drug discovery with valuable information.
Assessing pleiotropy of SNPs near 19 candidate drug targets
In this study, we queried the literature for genes nominated through GWAS as putatively causally linked to the risk for common complex human diseases and supported by various degrees of additional genetic or biological evidence. We selected 19 genes that, based on previously described genetic associations with either immune-mediated (9 genes: ATG16L1, CARD9, CD226, CDHR3, GPR35, GPR65, IFIH1, IRF5, and TYK2), cardiometabolic (8 genes: F11, F12, GDF15, GUCY1A3, KNG1, LGALS3, PNPLA3, and SLC30A8), or neurodegenerative diseases (2 genes: LRRK2, TMEM175), were evaluated as potential novel drug targets (Table 1). Gene-disease associations had been established through 25 common lead single nucleotide polymorphisms (SNPs) that all reached a conservative level of statistical significance (P < 5 × 10−8) for association in GWAS with at least one phenotype of relevance to drug discovery and development (Supplementary Table 1). All of these SNPs have either been demonstrated to impact the target gene in functional studies (genetic evidence), or locate proximal to a gene implicated in a biological mechanism related to the GWAS phenotype (biological evidence). Our selection ranged from targets with little biological knowledge beyond GWAS nomination (e.g., TMEM175 for Parkinson’s disease (PD)) to targets with drug candidates in early clinical trials (e.g., F11 for thromboembolism). Details on the genetic and biological support for all selected genes and SNPs is provided in Supplementary Methods.
To broadly investigate pleiotropic effects of the 25 chosen SNPs in a maximal number of individuals, we interrogated four large disease-agnostic cohorts that link genome-wide genotype data from individuals of European ancestry with extensive phenotypic data: the 23andMe Inc. cohort with self-reported phenotypes on 671,151 research participants22, the interim UK Biobank cohort analyzed by Genomics plc with questionnaire-based health information on 112,337 participants (from the first genetic data release in May 2015)10, and two EHR-based cohorts from an adult Finnish cohort (FINRISK; 21,371 participants)23 and from a pediatric healthcare population from the Children’s Hospital of Philadelphia (CHOP; 12,044 patients)24 (Table 2 and Methods). All four cohorts contributed phenotypic data in different formats (medical interviews, self-reports, WHO ICD codes, or ICD9-CM codes) in both shared and distinct phenotype categories (Fig. 1a). Manual phenotype mapping identified 145 distinct clinical endpoints that were tested in two or more cohorts in up to 697,815 individuals (Fig. 1b, Supplementary Table 2, and Supplementary Table 3). As illustrated in Fig. 1c, these 145 mapped phenotypes represent a broad spectrum of disease categories and, as typically observed in disease-agnostic cohorts, show significant variability in the case:control ratios, both within and between cohorts. In addition, PheWAS in the four cohorts provided association results for 1538 cohort-specific unmapped endpoints, leading to a total of 1683 endpoints included in our analysis. Association testing in the cohorts was performed using logistic regression models; meta-analyses were performed using fixed effect models (see Methods for details).
Meta-PheWAS replicate known GWAS signals
We first evaluated whether association testing in the four disease-agnostic cohorts replicated established results from published GWAS. GWAS had associated the 25 tested SNPs with genome-wide significance to 58 binary disease endpoints. Of these, 47 endpoints were ascertained with adequate power (beta ≥ 0.8) to reach P < 0.05 in the PheWAS meta-analysis. After excluding the three Parkinson’s disease associations that were derived from 23andMe data in the published GWAS, we observed that 33 of the 44 (75%) powered GWAS associations replicated at P < 0.05 in our PheWAS meta-analysis with consistent directions of effects (18/27 (67%) powered GWAS associations replicated at FDR < 0.1 (P < 3.8 × 10−4)) (Supplementary Figs 1, 2, and Supplementary Table 4). The overlap between the published GWAS effect sizes and the confidence intervals observed in the meta-PheWAS and in the four cohorts is provided in Supplementary Figs 2 and 3. As expected from data obtained in real-world settings, the replication rate of known associations was highly disease-dependent (Supplementary Fig. 1B). For instance, out of the 11 associations that failed to replicate despite sufficient case numbers in the cohorts, eight were associations with inflammatory bowel disease (IBD), Crohn’s disease (CD), or ulcerative colitis (UC), likely reflecting suboptimal ascertainment of these endpoints in real-world settings. Nonetheless, the high replication rate of previously reported associations demonstrates the power of combining disease-agnostic cohorts from various sources to detect and validate true SNP-disease associations, and to substantiate therapeutic hypotheses.
Meta-PheWAS identify novel SNP-phenotype associations
We next investigated whether meta-PheWAS across the four cohorts could identify novel associations to support the proposed clinical indication(s) (derived from established genetic associations, see Table 1), suggest alternative indications for drug repositioning, or uncover potential target-related ADEs. To improve statistical power in this analysis, the PheWAS results in the four cohorts were meta-analyzed together with summary statistics from published GWAS studies of 34 diseases available from a larger database assembled and harmonized by Genomics plc (referred to as Genomics plc GWAS, Supplementary Note 1). Overall, 27,763 association tests (across 145 harmonized and 1538 cohort-specific endpoints) resulted in nine putative novel associations reaching study-wide significance after Bonferroni correction (P < 1.8 × 10−6) (Table 3). Using a less stringent significance threshold of FDR < 0.1 (P < 7 × 10−4) previously applied in PheWAS25, we identified 71 distinct putative novel associations (Fig. 2, Supplementary Table 5 and Supplementary Data 1). Of these, 30 were with mapped phenotypes and were obtained from meta-analyzing results from at least two cohorts, and 41 were supported by a single cohort (and thus require independent replication) (Supplementary Table 5). Forty-three of these putative novel associations showed the same directions of effect as disease endpoints related to the proposed clinical indication for a drug and may hint at potential repositioning opportunities (Supplementary Fig. 4). Conversely, 27 showed directions of effect opposite to disease endpoints related to the proposed clinical indication and may suggest safety signals that could endanger therapeutic success and warrant monitoring for in preclinical models and clinical trials (Supplementary Fig. 4).
The 30 novel associations with mapped phenotypes showed limited evidence of heterogeneity between the PheWAS cohorts (Supplementary Fig. 5). Twenty-three (77%) of these 30 associations showed an I2 < 40%. Manual review of the results showed that only one of the seven associations with I2 > 40%, the GDF15 rs17724992 association with high blood pressure, was less significant in the meta-analysis than in the individual cohorts (P23andMe = 6.4 × 10−10, OR23andMe = 0.96; PGplc/UK Biobank = 0.58, ORGplc/UK Biobank = 0.99; Pmeta = 7.6 × 10−9, ORmeta = 0.97) (Supplementary Fig. 5B).
Replication of novel associations in UK Biobank v2
Forty-one of the 71 potential novel associations reaching FDR < 0.1, including eight of the nine novel associations reaching study-wide significance, were with phenotypes tested by Neale et al. through GWAS in the expanded UK Biobank (v2) cohort of up to 337,199 participants of European ancestry. In an attempt to replicate putative novel associations discovered in our meta-PheWAS, we performed weighted Z score-based meta-analyses between the 23andMe, FINRISK and CHOP PheWAS results, the published GWAS results and the UK Biobank v2 results (excluding the Gplc UK Biobank results). Out of the 41 putative novel associations, 16 showed P < 0.05 in UK Biobank v2 with consistent direction of effect, thus validating and further strengthening significance of our previous results (Supplementary Table 6). An additional seven potential novel associations showed increased significance in meta-analysis despite P > 0.05 in UK Biobank v2, largely due to small number of cases and lack of statistical power in UK Biobank v2 alone. Overall, meta-analysis with UK Biobank v2 strengthened all eight novel associations with study-wide significance after Bonferroni correction and 23/41 (56%) of the potential novel associations with FDR < 0.1, including eight associations that were based on results from a single PheWAS cohort. Strengthened associations in the meta-analysis with UK Biobank v2 include the rs17724992-high blood pressure association that showed significant heterogeneity between the 23andMe and the interim UK Biobank cohorts (P23andMe = 6.4 × 10−10, OR23andMe = 0.96; PUK Biobank v2 = 4.4 × 10−5; Pmeta_v2 = 3.9 × 10−13).
Interpretation of apparent pleiotropy in PheWAS results
A challenge to the PheWAS approach is to reliably distinguish true pleiotropic associations of a SNP (or SNPs in strong LD with the lead SNP), suggesting a shared causal mechanism, from unrelated associations driven by independent SNPs at a locus18. For instance, in our study, the putative association of rs2274273 near LGALS3 (encoding the galactin-3 protein) with PD (OR23andMe = 0.94, P23andMe = 1 × 10−4) likely reflects a distinct causal mechanism previously attributed to GCH126. rs2274273 is a protein quantitative trait locus (pQTL) that controls plasma levels of galectin-327. Through a Bayesian test for co-localization using summary statistics from published GWAS studies26,28,29, we excluded rs2274273 as a causal SNP for PD (posterior probability for a shared variant leading the PD and galectin-3 levels associations = 0.0008%) (Supplementary Fig. 6).
A second challenge to PheWAS is the existence of common co-morbidities among endpoints, or alternatively an insufficient distinction between phenotypes19. In our meta-PheWAS, rs17724992 near GDF15 showed association with multiple cardiovascular-related phenotypes, which is likely mediated by the known association of this SNP with body mass index (BMI)30, an established risk factor for cardiovascular disease31. This is supported by the lack of association of rs17724992 with blood pressure (PSBP = 0.064, PDBP = 0.134) and coronary artery disease (CAD, P = 0.17) in the large GWASs published by the International Consortium for Blood Pressure and the CARDIoGRAMplusC4D consortium32,33. Phenotype correlation scores can indicate apparent pleiotropic effects that may be explained by comorbidities or confounding (Supplementary Fig. 7), yet follow-up customized association analyses adjusting for specific phenotypic covariates are required to distinguish true pleiotropic effects and inform target validation.
In summary, these two examples demonstrate that thorough investigation of association results can reduce biases introduced through PheWAS.
Meta-PheWAS reveal pleiotropic effects of PNPLA3 rs738409
Among the nine study-wide significant associations, our meta-PheWAS revealed multiple novel associations for the PNPLA3 missense SNP rs738409 (p.I148M). The rs738409-G allele has previously been reported as associated with an increased risk for non-alcoholic fatty liver disease (NAFLD), alcohol-related cirrhosis and hepatic steatosis, as well as elevated alanine aminotransferase (ALT) levels, most likely through a gain-of-function (GOF) mechanism (Supplementary Methods). Consistent with these findings, our meta-PheWAS found rs738409-G to be associated with elevated liver tests (OR23andMe = 1.25, P23andMe = 4 × 10−45) (Supplementary Fig. 8). Beyond that, our analysis also indicated that carriers of the rs738409-G allele that increases ALT are more prone to develop liver toxicities when treated with nonsteroidal anti-inflammatory drugs (NSAIDs) such as ibuprofen (OR23andMe = 1.43, P23andMe = 4.6 × 10−5) or aspirin (OR23andMe = 1.57, P23andMe = 5.3 × 10−5). It also confirmed the association of rs738409-G with increased risk of T2D (ORmeta = 1.08, Pmeta = 8 × 10−11) recently reported in a T2D fine-mapping study that confirmed rs738409 as the most likely causal SNP34. Our meta-PheWAS further revealed significant associations between rs738409-G and decreased risk for high cholesterol (ORmeta = 0.96, Pmeta = 1.6 × 10−7; Pmeta_v2 = 1.1 × 10−8) and the intake of cholesterol-lowering medications (OR23andMe = 0.97, P23andMe = 2 × 10−4; Pmeta_v2 = 2.8 × 10−5), consistent with recent results from the lipids exome chip study describing a significant association of rs738409-G with decreased LDL levels35. In addition, the meta-PheWAS revealed novel significant associations between the rs738409-G GOF allele and decreased risk for acne (OR23andMe = 0.90, P23andMe = 1.5 × 10−11; Pmeta_v2 = 7.3 × 10−12), gout (ORmeta = 0.92, Pmeta = 4.1 × 10−5; Pmeta_v2 = 3.9 × 10−9), and gallstones (ORmeta = 0.95, Pmeta = 2.7 × 10−4; Pmeta_v2 = 1.5 × 10−5). All these associations remained prominent after adjusting for elevated liver tests (Supplementary Table 7), and were further strengthened in the meta-analysis with the expanded UK Biobank cohort (Supplementary Table 6). Taken together, our PheWAS results support the hypothesis that therapeutic inhibition of PNPLA3 could treat liver diseases. They also support T2D as a potential alternative indication for PNPLA3 inhibition. However, concomitant inverse associations with multiple other endpoints, including acne and high plasma cholesterol levels, indicate potential clinically relevant on-target ADEs that should be considered in decisions to progress PNPLA3 inhibitors toward clinical development.
IFIH1 partial loss-of-function increases asthma risk
The meta-PheWAS further revealed novel, important pleiotropic effects for drugs directed toward IFIH1. Carriers of the IFIH1 (encoding MDA5) rs1990760-C allele (MAF = 40%) have an established lower risk for several autoimmune diseases (type 1 diabetes, T1D; vitiligo; systemic lupus erythematosus, SLE; psoriasis) and an increased risk for UC (Supplementary Methods). Functional studies suggest that rs1990760-C (p.T946A) causes IFIH1 loss-of-function (LOF), and additional IFIH1 LOF alleles have been shown to protect against T1D, vitiligo, psoriasis, and psoriatic arthritis (PsA) (Supplementary Methods). Our meta-PheWAS support these associations (Fig. 2 and Supplementary Table 4). Beyond this, we found a significant novel association between rs1990760-C and increased risk for asthma (ORmeta = 1.04, Pmeta = 6.7 × 10−8) that reached Pmeta_v2 = 2 × 10−8 in the meta-analysis with the expanded UK Biobank cohort (Fig. 3a and Supplementary Table 6). The association between rs1990760 and asthma was supported by data from all four disease-agnostic cohorts as well as the GABRIEL and EVE asthma GWAS cohorts36,37, despite lack of power to detect an association with rs1990760 in the published GWAS cohorts alone (Fig. 3b). This association remained significant after adjustment for autoimmune diseases in the 23andMe cohort, demonstrating that the asthma association is independent of the previously established associations of rs1990760 with autoimmunity (Supplementary Table 8). Co-localization analysis confirmed that the same SNP was responsible for the SLE, UC, and asthma associations at the locus, supporting true pleiotropic effects driven by the same causal variant(s) (Fig. 3c). The observed IFIH1 pleiotropic effects were further strengthened by the observation in the Genomics plc UK Biobank data that the independent low-frequency IFIH1 missense allele p.I923V (rs35667974-C, MAF = 1.8%), previously reported to result in IFIH1 LOF and to protect against T1D, vitiligo, psoriasis, and PsA, and to increase risk of UC, was also associated with increased risk of asthma (ORGplc/UK Biobank = 1.18, PGplc/UK Biobank = 1.1 × 10−4) (Fig. 3d). Together, these and previous findings establish IFIH1 as a gene with an allelic series38 and further support the therapeutic hypothesis that inhibition of MDA5 may protect against several autoimmune diseases. However, our results also reveal the potential of MDA5 inhibitors to cause pulmonary ADEs and strengthen previous findings for an increased risk for colitis-related symptoms, endpoints that may limit the therapeutic window of MDA5 modulators and should be considered for monitoring in clinical trials.
PheWAS assist target prioritization for thromboembolism
Beyond informing on individual genes, we hypothesized that PheWAS might help prioritize targets among several candidates within a biological pathway. Factors XI, XII, and plasma kininogen (encoded by KNG1) are members of the contact activation coagulation pathway39. Anti-coagulation therapies directed against these factors are hypothesized to have improved therapeutic windows over current standard-of-care, which is accompanied by significant bleeding liabilities40. With the aim to estimate genetic risk–benefit profiles for the three candidate targets, we chose to interrogate three uncorrelated SNPs at the F11, KNG1, and F12 loci. These three SNPs had similar allele frequencies in Europeans, had previously been shown to impact FXI, FXII, and/or KNG1 mRNA and/or protein levels, and are associated with activated partial thromboplastin time (aPTT), a biomarker of blood clotting, or venous thromboembolism (VTE) risk (Supplementary Methods and Supplementary Table 1). Carriers of the rs4253399-T allele, which reduces circulating FXI levels and increases aPTT, showed an expected lower risk for blood clots (ORmeta = 0.84, Pmeta = 3.5 × 10−25)41, but no evidence for association with bleeding tendency (OR23andMe = 1.04, P23andMe = 0.35) (Fig. 4). In contrast, carriers of the KNG1 allele rs5030062-A, which reduces plasma kininogen as well as circulating FXI, and increases aPTT, showed both reduced blood clotting (ORmeta = 0.93, Pmeta = 1.6 × 10−4) as well as increased bleeding liability (OR23andMe = 1.14, P23andMe = 4.1 × 10−4). A nominal association with both phenotypes was found in carriers of the FXII levels-reducing and aPTT-increasing allele rs2731672-T (blood clots: OR23andMe = 0.96, P23andMe = 0.034; bleeding tendency: OR23andMe = 1.09, P23andMe = 0.039).
By comparing these results with the effect of the three SNPs on aPTT (Supplementary Table 1), our study suggests that, among the three factors tested, targeting FXI may yield the best compromise between thromboembolism risk reduction and increased bleeding liability, which is consistent with the outcomes of a recent phase 2 clinical trial42.
Our study investigates the utility of PheWAS to help predict therapeutic success of candidate drug targets nominated through human genetics. We focused on a selection of loci that GWAS have firmly established as associated with common immune-mediated, cardiometabolic, or neurodegenerative human diseases, and where additional biological or genetic evidence supports candidate drug target genes within these loci as likely causing the disease associations. We analyzed SNPs impacting these targets for association with 1683 disease endpoints captured in four large, disease-agnostic population cohorts that link genome-wide genotypes with various types of structured health information. Our PheWAS meta-analysis replicates 75% of the published GWAS associations at P < 0.05, substantially surpassing performance of previous PheWAS in smaller cohorts25. Through meta-analyzing PheWAS results with published GWAS data, we identified nine novel SNP-phenotype associations that exceeded stringent significance thresholds for multiple test correction, as well as additional putative associations with therapeutically relevant clinical endpoints. For a subset of early drug targets, our results support previous genetic evidence for efficacy in distinct common disease indications. Our analysis further proposes alternative indications as opportunities for drug repositioning and predicts on-target adverse drug events that may warrant preclinical or clinical monitoring.
Among others, we discovered novel associations for p.I148M in PNPLA3. This is a common gain-of-function missense allele increasing the risk for a range of liver phenotypes, which suggested that pharmaceutical inhibition of PNPLA3 could be a viable strategy to treat or prevent liver diseases. While our PheWAS support this hypothesis and further backs expanding the indication spectrum of a putative PNPLA3 inhibitor to T2D, we also uncovered opposite associations with severe acne and high cholesterol, phenotypes that if observed during a clinical trial might put a therapeutic program at risk.
We also identified a novel association of the IFIH1 loss-of-function allele rs1990760-C (p.T946A) with risk of asthma. The rs1990760-C allele, which protects against several autoimmune diseases and increases risk of UC, has been shown to decrease interferon (IFN) signaling and lower resistance to viral challenge43, while complete loss of IFIH1 function makes children susceptible to severe viral respiratory infections44,45. The association of rs1990760-C with increased risk of asthma discovered in our meta-PheWAS is consistent with the observation that bronchial epithelial cells from asthmatics produce lower amounts of IFN-β during viral infections46, a finding that lead to inhaled IFN-β being tested in phase 2 clinical trials for the treatment of virus-induced asthma exacerbation47. Future studies will need to investigate the risk:benefit ratio of modulating MDA5 (encoded by IFIH1) for asthma relative to autoimmune diseases.
While our study illustrates the power of systematically interrogating disease-agnostic cohorts with extensive health information to enrich target validation, it also emphasizes several opportunities to improve existing resources in order for PheWAS to become a routine tool in drug discovery and development. First, truly large, thoroughly phenotyped cohorts will be needed to adequately power PheWAS. Despite our meta-PheWAS being conducted in close to 700,000 individuals, 20% of GWAS associations could not be replicated (P < 0.05) in the disease-agnostic cohorts due to an insufficient number of cases. In addition, PheWAS should considerably gain from improved phenotypic endpoints48. In our study, this is best reflected by an only modest replication rate, despite adequate power, for CD, UC, and IBD endpoints that are closely related and difficult to discern from other disorders in routine clinical settings49. To better take these considerations and other characteristics of disease-agnostic cohorts (typical case:control ratio unbalance between phenotypes and phenotype correlation) into account, novel statistical methods will be needed to better define significance thresholds and control type I error rates in PheWAS50. Second, our study highlights the challenge to systematically combine phenotypes from independent disease-agnostic cohorts with various phenotype data sources. While we introduce the concept of meta-PheWAS and demonstrate that mapping phenotypes to interrogate independent PheWAS cohorts may considerably strengthen association signals, there is still a need for standardized terminology, automated phenotype extraction, and coordinated data management across healthcare institutions that will help with better harmonization across cohorts in the future9,51. A third challenge to the PheWAS approach is inherent to the current limitations of human genetics. Even when starting from a highly-annotated set of loci as in our study, PheWAS may lead to spurious interpretation of association results that can only be ruled out through thorough follow-up18. We demonstrate this at the example of LGALS3 and PD. Access to genome-wide association results for systematic fine-mapping and co-localization analyses, functionalization of GWAS loci and the emergence of association data for intermediate phenotypes, e.g., at the protein level, will be needed to help narrow the gap between SNPs and candidate target genes in the future. Finally, a fourth challenge to broadly use PheWAS for drug development is to relate findings from germline variants that impact a target across an individual’s entire lifetime to success of an interventional trial with much shorter observation periods. In the end, many decisions to pursue or discontinue a therapeutic program may remain dependent on the specific risk:benefit ratio that quantitative genetics as applied here may help to predict, and the level of unmet clinical need.
Taken together, our study highlights PheWAS as a highly promising, yet largely untapped opportunity to use disease-agnostic cohorts with extensive health information for drug target validation. We provide several examples that illustrate PheWAS as a powerful strategy to help predict efficacy and unintended drug effects, which should ultimately help to develop better drugs. Whether PheWAS may truly impact decision making during drug development will only become evident with either the emergence of ADEs in trials that genetics could have predicted, or reduced safety-related attrition rates for portfolios enriched in targets nominated through human genetics. The growing number of large-scale population cohorts that link genetic data with extensive health data, together with an increased willingness across the borders of academia, biotech and the pharmaceutical industry to collaborate and share data, will provide opportunities to demonstrate that.
In this study, we selected 25 SNPs that were significantly associated (P < 5 × 10−8) in published GWAS with binary or quantitative phenotypes related to three main therapeutic areas: (auto)immune, cardiometabolic, or neurodegenerative diseases (Supplementary Methods). These 25 SNPs had either been functionally validated in published studies, establishing the candidate target gene as causal for the risk of disease, or they were located within or near genes (as defined by the regions encompassing all SNPs in r2 > 0.5 to the GWAS index SNPs extended to the nearest recombination hot spots) for which previous studies had generated convincing biological evidence to be of relevance for the respective clinical endpoint. The 25 SNPs were linked to 19 genes that were evaluated as candidate drug targets. Detailed information on the SNPs, candidate causal genes and their link to common human disease is provided in Supplementary Methods. The list of SNPs and their known associated phenotypes is provided in Supplementary Table 1.
We interrogated four large observational disease-agnostic cohorts of subjects of European ancestry with genome-wide genotyped data linked to extensive phenotypic information (Table 2). All participants included in each of the four cohorts were unrelated individuals of European ancestry. Individual-level data from each cohort was analyzed independently, and the relevant summary statistics for the 25 SNPs were shared for further analysis. We restricted all cohorts to binary disease phenotypes with at least 20 cases per cohort. All endpoints were derived from questionnaires or ICD codes (including endpoints like high cholesterol or high blood pressure). No quantitative laboratory measurements were included in the study.
The 23andMe cohort comprised up to 671,151 participants and 654 binary disease endpoints derived from questionnaire-based self-reports22. Participants were restricted to a set of individuals who have > 97% European ancestry, as determined through an analysis of local ancestry using a support vector machine (SVM) and a hidden Markov model (HMM) to assign individuals to one of 31 reference populations. For each phenotype, we chose a maximal set of unrelated individuals using a segmental identity-by-descent (IBD) estimation algorithm. We defined individuals as related if they shared > 700 cM IBD on either one or both of their chromosomes. SNPs with Hardy–Weinberg equilibrium P < 10−20, call rate < 95%, or strong allele frequency deviation from European 1000 Genomes reference data were excluded. Participant genotype data were then imputed against the September 2013 release of 1000 Genomes Phase1 reference haplotypes52, using an internally developed phasing tool, Finch, which implements the Beagle haplotype graph-based phasing algorithm53, and Minimac254.
The Genomics plc analysis of UK Biobank cohort (referred to as ‘Genomics plc UK Biobank’) comprised 112,337 participants and 90 binary disease endpoints derived from questionnaire-based self-reports and medical interviews10. GWAS analyses were performed by Genomics plc using the interim data release (May 2015). QC followed the recommendations provided by UK Biobank. European ethnicity was defined as self-reported white British ethnic background, and confirmed by principal component analysis clustering. Samples with relatives (3rd degree or closer) were excluded. Imputation was carried out by the UK Biobank data providers using SHAPEIT355, IMPUTE356, and a reference panel combining the 1000 Genomes Phase 357 and UK10K datasets58.
FINRISK is a collection of cross-sectional population surveys carried out since 1972 to assess the risk factors of chronic diseases and health behavior in the working age population of Finland23. The FINRISK cohort comprised 21,371 Finnish participants and 269 binary disease endpoints derived from ICD codes grouping in Finnish national hospital registries and cause-of-death registry, and drug reimbursement and purchase registries. The FINRISK samples were genotyped using Illumina CoreExome, OmniExpress, and 610K chips. After gender check, samples with genotype missing rate > 5% or excess heterozygosity (> 4SD) were excluded. SNPs QC, including exclusion of SNPs with genotype missing rate > 2%, minor allele frequency <1%, or Hardy–Weinberg equilibrium P value <1 × 10−6, was performed for each genotyping chip separately. Multidimensional scaling (MDS) components were estimated with PLINK v1.959 from the LD-pruned genotype data where relatives with pi-hat > 0.2 had been removed. Samples with non-Finnish ancestry observed as MDS outliers were removed. Imputation was performed with SHAPEIT55 and IMPUTE256 using a reference panel combining information from the 1000 Genomes phase 357 and 1941 Finnish SiSu whole genome sequences60. Imputation was stratified based on genotyping chip.
The cohort from the Children’s hospital of Philadelphia (CHOP) comprised 12,044 pediatric patients and 870 binary disease endpoints derived from ICD9–CM codes using the ICD9-to-PheWAS codes mapping described by Denny et al.24,61. Subjects included in the CHOP PheWAS were genotyped on one of the following genotyping chips following the Illumina standard protocols: Illumina Human610-Quad version 1, Illumina 550K SNP array, or Illumina OmniExpress array. Samples with genotype call rate > 95% were included in the study. SNPs with genotype missing rate > 5%, minor allele frequency <1%, and Hardy–Weinberg equilibrium P value < 0.00001 were excluded. Principle component analysis (PCA) was performed using EIGENSTRAT62 on ∼130,000 SNPs that had been pruned for linkage disequilibrium using PLINK v1.0759 and reference genotypes from the HapMap consortium63. Imputation was performed with SHAPEIT v255 and IMPUTE256 using the 1000 Genomes project phase 1 reference panel52. SNPs with INFO scores < 0.9 were excluded.
All the participants in the 23andMe, Genomics plc UK Biobank, FINRISK, and CHOP cohorts provided written informed consent for participating in research studies. Blood or saliva samples were collected according to protocols approved by local institutional review boards. Details are provided in the original publications describing the cohorts10,22−24. This research has been conducted using the UK Biobank resource under the Genomics plc project application number 9659.
In addition, with the aim to replicate novel associations identified in the four disease-agnostic cohorts, we interrogated genome-wide summary statistics from 57 published GWAS, including 34 binary disease phenotypes, derived from a larger database that has been assembled and harmonized by Genomics plc (referred to as ‘Genomics plc GWAS’). The full list of studies in Genomics plc GWAS database and tested in this study is available in the Supplementary Note 1). These included checks to ensure consistency of the data, and alignment of alleles to the forward strand of the human reference sequence, with effects ascribed to the alternative allele. Effect size estimates for quantitative traits were rescaled relative to the residual variance. Summary-statistic imputation was applied to infer association evidence at common variants (minor allele frequency > 2%) in the 1000 Genomes EUR reference panel. Results for SNPs associated with the relevant phenotype with P < 0.05 were included in the meta-analysis.
Correlation between all GWAS was estimated to ensure that no GWAS included in the meta-analysis for a given phenotype presented overlapping samples. In addition, to further prevent GWAS results from overlapping samples to be meta-analyzed, only the most recent/largest study for a given disease was included in our analysis when several GWAS studies in the Genomics plc database investigated the same disease. Although we could not directly estimate potential overlapping samples between the different disease-agnostic cohorts, significant overlap is very unlikely based on the participants’ characteristics (Table 2).
Identification of shared phenotypes
The phenotypic endpoints tested in the 23andMe, UK Biobank, FINRISK, and CHOP cohorts were derived from different sources (self-reports, self-reports and medical interviews, WHO ICD codes, and ICD9-CM codes, respectively) and named using different nomenclatures (e.g., clinical terms versus popular terms, abbreviations versus full names). In order to compare and combine results from the four cohorts with published GWAS results from the Genomics plc database, we manually mapped the phenotypes. Examples of mapped and unmapped phenotypic endpoints are provided in Supplementary Table 2. This step allowed us to identify 145 distinct phenotypes shared by at least 2 cohorts and with at least 20 cases in the independent cohorts (Fig. 1). The full list of mapped phenotypes is provided in Supplementary Table 3 and the Supplementary Data 1. We note that, in each cohort some phenotypes were captured multiple times by different endpoints with slightly different definitions. In this case, only one endpoint per cohort was selected for meta-analysis.
PheWAS and meta-analysis
Phenome-wide association analyses for each of the 25 SNPs were conducted in the 23andMe, Genomics plc UK Biobank, FINRISK (PheWAS results release November 2016), and CHOP cohorts separately. Each SNP-phenotype association was tested independently (assuming an additive genetic model), using logistic regressions adjusted for age, gender, and principal components to adjust for population stratification. Genotyping batch and survey cohort were also included as covariates in the FINRISK PheWAS. We then performed two distinct analyses to (1) replicate known GWAS associations, and (2) to detect novel associations.
First, we meta-analyzed PheWAS results from the four cohorts, to investigate the ability of these cohorts to replicate known GWAS associations. After harmonizing the effect alleles across the cohorts, fixed effect meta-analyses were performed using PLINK59. I2 statistic and manual review of the meta-analyzed results were used to evaluate heterogeneity between cohorts.
We then compared the meta-analysis association results with known significant SNP-phenotype associations from published GWAS, taking into account the statistical power to detect an association in the meta-analysis of the PheWAS results in the disease-agnostic cohorts.
Second, we meta-analyzed results from the four disease-agnostic cohorts together with available GWAS results in order to detect novel associations. Meta-analysis was performed using PLINK as described above. Meta-analysis results at the 145 shared phenotypes were then combined with cohort-specific phenotype results from the 25 SNPs, resulting in 27,762 tests in total. It is clear given the structure of this PheWAS and meta-PheWAS that the 27,762 tests are not independent tests, which requires thought about the most appropriate method to control for multiple testing correction. We have chosen two methods, one that provides an extremely, over-conservative multi-testing correction assuming independence (Bonferroni correction) and one less conservative method that has been shown to be robust to test dependency (Benjamini & Hochberg’s False Discovery Rate (FDR))64. Benjamini and Yekutieli (2001) illustrated that the FDR procedure is robust to positive correlation amongst tests65, therefore we have chosen to use the standard Benjamini & Hochberg FDR procedure implemented in the p.adjust method in R. For defining significance in this study, we set a FDR threshold of 0.1, which corresponded to P < 7 × 10−4. The over-conservative significance threshold based on Bonferroni correction was P = 0.05/27,762 = 1.8 × 10−6. We note that Bonferroni correction ignores the correlation structure between the tested phenotypes or the fact that all the SNPs tested in this study are known to be associated with one or several phenotypes in published GWAS.
Meta-analysis with UK Biobank v2 association results
To further test the robustness of the putative novel associations identified in our study, we performed a meta-analysis of the 23andMe, FINRISK, CHOP, and published GWAS results for 41 SNP-phenotype pairs with association results released by Neale et al. from an analysis of the expanded UK Biobank cohort, consisting of up to 337,199 unrelated participants of European ancestry (referred to as UK biobank v2). In order to meta-analyze these UK Biobank v2 results, which had been obtained using linear regression models, with the PheWAS cohorts and GWAS results of the current study, which were obtained using logistic regression models, we performed a weighted Z score meta-analysis. For each SNP-phenotype pair in each study i, we defined weights using the following equation:
where Nai and Nui are the numbers of cases and controls in study i, respectively.
For each SNP-phenotype pair, we then calculated the meta-analysis Z score as follows:
Zi is the Z score in study i, derived from the logistic or linear regression model.
The UK Biobank GWAS results used in this analysis have been released by the Neale’s lab under the following URL: https://sites.google.com/broadinstitute.org/ukbbgwasresults/home?authuser=0.
Statistical power estimations
We estimated statistical power to detect an association with known associated phenotypes using a formula adapted from Yang et al.66, based on the published effect size in the most recently published GWAS, the frequency of the associated SNP risk allele in the 1000Genomes EUR population, the number of cases and controls in the disease-agnostic cohorts, and the following phenotype prevalence reported by the Centers for Disease Control and Prevention (https://www.cdc.gov): coronary artery disease, 5.8%; Crohn’s disease, 0.2%; inflammatory bowel disease, 0.44%; myocardial infarction, 3%; multiple sclerosis, 0.09%; primary biliary cirrhosis, 0.04%; Parkinson’s disease, 0.07%; psoriasis, 3%; rheumatoid arthritis, 0.6%; systemic lupus erythematosus, 0.2%; systemic scleroderma, 0.02%; type 1 diabetes, 0.5%; type 2 diabetes, 9%; ulcerative colitis, 0.24%; venous thromboembolism, 0.4%; vitiligo, 1%.
To distinguish true pleiotropic effects from multiple associations at the loci that are explained by different causal SNPs (and potentially incriminating different causal genes), we used association summary statistics available from published GWAS and applied a Bayesian test implemented in the R package ‘coloc’ to assess co-localization, i.e., the probability of sharing causal genetic variants between pairs of apparent pleiotropic phenotypes using association summary statistics at the loci of interest28. Co-localization analysis at the LGALS3 locus was performed using meta-analyzed PD GWAS summary statistics from 23andMe published elsewhere (N cases = 4127, N controls = 62,037)26, and galectin-3 plasma pQTL results in 3562 blood donors29. Co-localization analysis at the IFIH1 locus was performed using meta-analyzed SLE GWAS results from two independent published studies67,68, meta-analyzed asthma GWAS summary statistics from 23andMe69 and the Genomics plc UK Biobank (unpublished), and published UC GWAS summary statistics70.
Full results from meta-analysis of the 23andMe, Gplc/UK Biobank, FINRISK and CHOP cohorts with published GWAS results are provided in the Supplementary Data 1. All summary statistics results from PheWAS in the individual cohorts can be requested to the respective authors.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank the research participants from the 23andMe, UK Biobank, FINRISK, and CHOP cohorts for their contributions to this study. We would like to thank Jyoti Shah, Jennifer Pai, Mark Sharp, Hongjie Sun, and Ian Wallace for their input on phenotype mapping. We also thank Benjamin Neale’s lab for generating and sharing the GWAS summary statistics from the expanded UK Biobank cohort, and Benjamin Neale for his input on meta-analyzing these results with standard disease association results derived from logistic regression. M.M. and H.H. are supported by a sponsored Research Grant from Merck, the Institutional Development Fund from CHOP and the eMERGE consortium grant 1U01HG006830-01 from the NHGRI.