Genetic discovery from the multitude of phenotypes extractable from routine healthcare data can transform understanding of the human phenome and accelerate progress toward precision medicine. However, a critical question when analyzing high-dimensional and heterogeneous data is how best to interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations. Here we develop and employ a new Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to analyze genetic variants against UK Biobank disease phenotypes derived from self-reporting and hospital episode statistics. Our method displays a more than 20% increase in power to detect genetic effects over other approaches and identifies new associations between classical human leukocyte antigen (HLA) alleles and common immune-mediated diseases (IMDs). By applying the approach to genetic risk scores (GRSs), we show the extent of genetic sharing among IMDs and expose differences in disease perception or diagnosis with potential clinical implications.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Cohen, J.C., Boerwinkle, E., Mosley, T.H. Jr. & Hobbs, H.H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272 (2006).
Mallal, S. et al. HLA-B*5701 screening for hypersensitivity to abacavir. N. Engl. J. Med. 358, 568–579 (2008).
Manolio, T.A. Bringing genome-wide association findings into clinical use. Nat. Rev. Genet. 14, 549–558 (2013).
Nelson, M.R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).
Sanseau, P. et al. Use of genome-wide association studies for drug repositioning. Nat. Biotechnol. 30, 317–320 (2012).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Thompson, S.G. & Willeit, P. UK Biobank comes of age. Lancet 386, 509–510 (2015).
Jonsson, T. et al. A mutation in APP protects against Alzheimer's disease and age-related cognitive decline. Nature 488, 96–99 (2012).
Denny, J.C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Karnes, J.H. et al. Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants. Sci. Transl. Med. 9, eaai8708 (2017).
Bush, W.S., Oetjens, M.T. & Crawford, D.C. Unravelling the human genome–phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 17, 129–145 (2016).
Chan, K.S., Fowles, J.B. & Weiner, J.P. Electronic health records and the reliability and validity of quality measures: a review of the literature. Med. Care Res. Rev. 67, 503–527 (2010).
Denny, J.C., Bastarache, L. & Roden, D.M. Phenome-wide association studies as a tool to advance precision medicine. Annu. Rev. Genomics Hum. Genet. 17, 353–373 (2016).
Hersh, W.R. et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care 51 (Suppl. 3), S30–S37 (2013).
Hripcsak, G. & Albers, D.J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 20, 117–121 (2013).
Song, Y. et al. Regional variations in diagnostic practices. N. Engl. J. Med. 363, 45–53 (2010).
International Genetics of Ankylosing Spondylitis Consortium. Identification of multiple risk variants for ankylosing spondylitis through high-density genotyping of immune-related loci. Nat. Genet. 45, 730–738 (2013).
Colmegna, I., Cuchacovich, R. & Espinoza, L.R. HLA-B27-associated reactive arthritis: pathogenetic and clinical considerations. Clin. Microbiol. Rev. 17, 348–369 (2004).
Eastmond, C.J. & Woodrow, J.C. The HLA system and the arthropathies associated with psoriasis. Ann. Rheum. Dis. 36, 112–120 (1977).
Martin, T.M. & Rosenbaum, J.T. An update on the genetics of HLA B27-associated acute anterior uveitis. Ocul. Immunol. Inflamm. 19, 108–114 (2011).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Takagi, I., Eliyas, J.K. & Stadlan, N. Cervical spondylosis: an update on pathophysiology, clinical manifestation, and management strategies. Dis. Mon. 57, 583–591 (2011).
Gritz, D.C. & Wong, I.G. Incidence and prevalence of uveitis in Northern California; the Northern California Epidemiology of Uveitis Study. Ophthalmology 111, 491–500, discussion 500 (2004).
Okada, Y. et al. Fine mapping major histocompatibility complex associations in psoriasis and its clinical subtypes. Am. J. Hum. Genet. 95, 162–172 (2014).
Tsoi, L.C. et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat. Genet. 44, 1341–1348 (2012).
Gutierrez-Achury, J. et al. Fine mapping in the MHC region accounts for 18% additional genetic risk for celiac disease. Nat. Genet. 47, 577–578 (2015).
Raychaudhuri, S. et al. Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis. Nat. Genet. 44, 291–296 (2012).
Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47, 898–905 (2015).
Moutsianas, L. et al. Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat. Genet. 47, 1107–1113 (2015).
Goyette, P. et al. High-density mapping of the MHC identifies a shared role for HLA-DRB1*01:03 in inflammatory bowel diseases and heterozygous advantage in ulcerative colitis. Nat. Genet. 47, 172–179 (2015).
Martínez-Taboda, V.M. et al. HLA-DRB1 allele distribution in polymyalgia rheumatica and giant cell arteritis: influence on clinical subgroups and prognosis. Semin. Arthritis Rheum. 34, 454–464 (2004).
Haworth, S. et al. Polymyalgia rheumatica is associated with both HLA-DRB1*0401 and DRB1*0404. Br. J. Rheumatol. 35, 632–635 (1996).
Cleynen, I. et al. Inherited determinants of Crohn's disease and ulcerative colitis phenotypes: a genetic association study. Lancet 387, 156–167 (2016).
Denny, J.C. et al. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 89, 529–542 (2011).
Eriksson, N. et al. Novel associations for hypothyroidism include known autoimmune risk loci. PLoS One 7, e34442 (2012).
Mosley, J.D. et al. Identifying genetically driven clinical phenotypes using linear mixed models. Nat. Commun. 7, 11433 (2016).
Parkes, M., Cortes, A., van Heel, D.A. & Brown, M.A. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat. Rev. Genet. 14, 661–673 (2013).
Chen, G.B. et al. Estimation and partitioning of (co)heritability of inflammatory bowel disease from GWAS and Immunochip data. Hum. Mol. Genet. 23, 4710–4720 (2014).
Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).
Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).
Cortes, A. et al. Major histocompatibility complex associations of ankylosing spondylitis are complex and involve further epistasis with ERAP1. Nat. Commun. 6, 7146 (2015).
Tsokos, G.C. Systemic lupus erythematosus. N. Engl. J. Med. 365, 2110–2121 (2011).
de Lusignan, S. et al. A method of identifying and correcting miscoding, misclassification and misdiagnosis in diabetes: a pilot and validation study of routinely collected data. Diabet. Med. 27, 203–209 (2010).
Nogueira, T.C. et al. GLIS3, a susceptibility gene for type 1 and type 2 diabetes, modulates pancreatic beta cell apoptosis via regulation of a splice variant of the BH3-only protein Bim. PLoS Genet. 9, e1003532 (2013).
Østergaard, J.A., Laugesen, E. & Leslie, R.D. Should there be concern about autoimmune diabetes in adults? Current evidence and controversies. Curr. Diab. Rep. 16, 82 (2016).
Cervin, C. et al. Genetic similarities between latent autoimmune diabetes in adults, type 1 diabetes, and type 2 diabetes. Diabetes 57, 1433–1437 (2008).
Shields, B.M. et al. Can clinical features be used to differentiate type 1 from type 2 diabetes? A systematic review of the literature. BMJ Open 5, e009088 (2015).
Jensen, A.B. et al. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 5, 4022 (2014).
Wain, L.V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).
Dilthey, A. et al. Multi-population classical HLA type imputation. PLOS Comput. Biol. 9, e1002877 (2013).
Motyer, A. et al. Practical use of methods for imputation of HLA alleles from SNP genotype data. Preprint at. bioRxiv http://dx.doi.org/10.1101/091009 (2016).
International Multiple Sclerosis Genetics Consortium. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat. Genet. 45, 1353–1360 (2013).
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).
Onengut-Gumuscu, S. et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat. Genet. 47, 381–386 (2015).
This research has been conducted using the UK Biobank Resource (application number 10625). The research has been supported by the Wellcome Trust (095552/Z/11/Z to P.D., 100308/Z/12/Z to L.F., 100956/Z/13/Z to G.M., and 090532/Z/09/Z and 203141/Z/16/Z to the Wellcome Trust Centre for Human Genetics), the Danish National Research Foundation (grant number 126 to L.F.), the Wellcome Trust/Royal Society (204290/Z/16/Z to C.A.D.), Takeda, Ltd. (L.F. and C.A.D.), the Medical Research Council (grant number MC_UU_12010/3 to L.F.), and the Oak Foundation (OCAY-15-520 to L.F.). This work was supported by the Australian National Health and Medical Research Council (NHMRC), Career Development Fellowship 1053756 (S.L.), and by Victorian Life Sciences Computation Initiative (VLSCI) grant VR0240 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian government in Australia (S.L.). Research at the Murdoch Children's Research Institute was supported by the Victorian government's Operational Infrastructure Support Program.
G.M. is a cofounder of, holder of shares in, and consultant to Genomics, PLC. P.D. is a cofounder of, holder of shares in, and director and executive officer of Genomics, PLC. A.D., G.M., P.D. and S.L. are partners in Peptide Groove, LLP. Peptide Groove has licensed HLA typing technology to Affymetrix, Ltd. The other authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 BFtree statistic and the number of non-zero nodes at a threshold of PP = 0.75 over the parameter space of θ and π1 for HLA-B*27:05 allele association with risk for clinical diagnoses in the HES data set.
Supplementary Figure 2 Comparison of rate of active node identification in TreeWAS and PheWAS analyses with simulated data.
Rate of active node identification at increasing posterior probability (PP) thresholds and different simulated allele frequencies of the causal genetic variant, for the TreeWAS method (θ = 1/3 and π1 = 0.001) and assuming a model with complete independence among phenotypes (θ → ∞ and π1 = 0.001), which is equivalent to PheWAS. We simulated data for 500 replicates where the genetic variant affects clinical annotations found distributed in the tree. The rate of active node identification was calculated for the five affected clinical annotations (circles) and for the rest of the annotations is the tree with zero genetic coefficients (diamonds).
Sensitivity analysis in the detection of genetic association at the tree level as measured by the BFtree statistic. We simulated data where the causal variant affected clustered nodes in the tree and fitted the TreeWAS method (blue) and the PheWAS models where we assume complete independence among phenotypes (orange) and where we assume complete independence among phenotypes and all nodes to be active (yellow).
Sensitivity analysis in the detection of genetic association at the tree level as measured by the BFtree statistic. We simulated data where the causal variant affected distributed nodes in the tree and fitted the TreeWAS method (blue) and the PheWAS models where we assume complete independence among phenotypes (orange) and where we assume complete independence among phenotypes and all nodes to be active (yellow).
Linkage disequilibrium between independent HLA associations found in the analyses for the SR and HES data sets. Each allele shown was found in at least one of the analyses, seven of which were found in both. With the exception of the HLA-DRB1*15:01 and HLA-DQB1*06:02 alleles, all identified associations were not in linkage disequilibrium (r2 < 0.02). The HLA-DRB1*15:01 and HLA-DQB1*06:02 alleles were identified in the SR and HES analyses, respectively, and both are in high linkage disequilibrium (ρ = 0.98) and were fine-mapped to the same phenotypes.
Supplementary Figure 6 Effects of T1D on T2D diagnosis misclassification in TreeWAS summary statistics.
Individuals with a type 1 diabetes (T1D) diagnosis were misclassified as having type 2 diabetes (T2D) at different misclassification rates, ranging from 0.5 to 10% of the cohort with a T2D diagnosis. TreeWAS analysis was performed with both T1D and T2D GRSs in the SR and HES data sets. 100 simulations were generated for each misclassification rate. (a,b) Evidence of association at the tree level (BFtree) in the SR (a) and HES (b) data sets for the T1D and T2D GRSs. (c,d) Distribution of estimated posterior probabilities for the T1D and T2D diagnosis terms in the SR (c) and HES (d) data sets for both T1D and T2D GRS analyses. (e,f) Distribution of estimated effect sizes for the T1D and T2D terms in the SR (e) and HES (f) data sets for both T1D and T2D GRS analyses.
Supplementary Figure 7 Effects of T2D on T1D diagnosis misclassification in TreeWAS summary statistics.
Individuals with a type 2 diabetes (T2D) diagnosis were misclassified as having type 1 diabetes (T1D) at different misclassification rates, ranging from 0.5 to 10% of the cohort with a T1D diagnosis. TreeWAS analysis was performed with both T1D and T2D GRSs in the SR and HES data sets. 100 simulations were generated for each misclassification rate. (a,b) Evidence of association at the tree level (BFtree) in the SR (a) and HES (b) data sets for the T1D and T2D GRSs. (c,) Distribution of estimated posterior probabilities for the T1D and T2D diagnosis terms in the SR (c) and HES (d) data sets for both T1D and T2D GRS analyses. (e,f) Distribution of estimated effect sizes for the T1D and T2D terms in the SR (e) and HES (f) data sets for both T1D and T2D GRS analyses.
Supplementary Figure 8 Ancestry analysis of UK Biobank individuals using principal-component analysis.
120,286 individuals plotted in orange were retained in the analysis, and these co-cluster with European-ancestry populations.
β1 and β 2 are the log-odds coefficients for heterozygotes and homozygotes, respectively. The heat map indicates the relative density of the prior.
About this article
Cite this article
Cortes, A., Dendrou, C., Motyer, A. et al. Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. Nat Genet 49, 1311–1318 (2017). https://doi.org/10.1038/ng.3926
Nature Reviews Methods Primers (2021)
The International Journal of Biochemistry & Cell Biology (2021)
Trends in Molecular Medicine (2020)
Journal of Physics: Conference Series (2020)
Neurobiology of Disease (2020)