Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank

Abstract

Genetic discovery from the multitude of phenotypes extractable from routine healthcare data can transform understanding of the human phenome and accelerate progress toward precision medicine. However, a critical question when analyzing high-dimensional and heterogeneous data is how best to interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations. Here we develop and employ a new Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to analyze genetic variants against UK Biobank disease phenotypes derived from self-reporting and hospital episode statistics. Our method displays a more than 20% increase in power to detect genetic effects over other approaches and identifies new associations between classical human leukocyte antigen (HLA) alleles and common immune-mediated diseases (IMDs). By applying the approach to genetic risk scores (GRSs), we show the extent of genetic sharing among IMDs and expose differences in disease perception or diagnosis with potential clinical implications.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Schematic of a diagnosis classification tree and genetic coefficient transition scenarios tested.
Figure 2: Evidence of HLA-B*27:05 allele association with risk for clinical diagnoses in the HES data set.
Figure 3: Sensitivity and specificity analysis of TreeWAS on simulated data.
Figure 4: Genetic analysis of HLA allelic variation in the risk of clinical phenotypes from the UK Biobank SR diagnosis and HES data sets.
Figure 5: Association analysis of genetic risk for multiple IMDs derived from clinical phenotypes in the UK Biobank SR diagnosis and HES data sets.

Similar content being viewed by others

References

  1. Cohen, J.C., Boerwinkle, E., Mosley, T.H. Jr. & Hobbs, H.H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272 (2006).

    Article  CAS  PubMed  Google Scholar 

  2. Mallal, S. et al. HLA-B*5701 screening for hypersensitivity to abacavir. N. Engl. J. Med. 358, 568–579 (2008).

    Article  PubMed  Google Scholar 

  3. Manolio, T.A. Bringing genome-wide association findings into clinical use. Nat. Rev. Genet. 14, 549–558 (2013).

    Article  CAS  PubMed  Google Scholar 

  4. Nelson, M.R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).

    Article  CAS  PubMed  Google Scholar 

  5. Sanseau, P. et al. Use of genome-wide association studies for drug repositioning. Nat. Biotechnol. 30, 317–320 (2012).

    Article  CAS  PubMed  Google Scholar 

  6. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Thompson, S.G. & Willeit, P. UK Biobank comes of age. Lancet 386, 509–510 (2015).

    Article  PubMed  Google Scholar 

  8. Jonsson, T. et al. A mutation in APP protects against Alzheimer's disease and age-related cognitive decline. Nature 488, 96–99 (2012).

    Article  CAS  PubMed  Google Scholar 

  9. Denny, J.C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Karnes, J.H. et al. Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants. Sci. Transl. Med. 9, eaai8708 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Bush, W.S., Oetjens, M.T. & Crawford, D.C. Unravelling the human genome–phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 17, 129–145 (2016).

    Article  CAS  PubMed  Google Scholar 

  12. Chan, K.S., Fowles, J.B. & Weiner, J.P. Electronic health records and the reliability and validity of quality measures: a review of the literature. Med. Care Res. Rev. 67, 503–527 (2010).

    Article  PubMed  Google Scholar 

  13. Denny, J.C., Bastarache, L. & Roden, D.M. Phenome-wide association studies as a tool to advance precision medicine. Annu. Rev. Genomics Hum. Genet. 17, 353–373 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Hersh, W.R. et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care 51 (Suppl. 3), S30–S37 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Hripcsak, G. & Albers, D.J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 20, 117–121 (2013).

    Article  PubMed  Google Scholar 

  16. Song, Y. et al. Regional variations in diagnostic practices. N. Engl. J. Med. 363, 45–53 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. International Genetics of Ankylosing Spondylitis Consortium. Identification of multiple risk variants for ankylosing spondylitis through high-density genotyping of immune-related loci. Nat. Genet. 45, 730–738 (2013).

  18. Colmegna, I., Cuchacovich, R. & Espinoza, L.R. HLA-B27-associated reactive arthritis: pathogenetic and clinical considerations. Clin. Microbiol. Rev. 17, 348–369 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Eastmond, C.J. & Woodrow, J.C. The HLA system and the arthropathies associated with psoriasis. Ann. Rheum. Dis. 36, 112–120 (1977).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Martin, T.M. & Rosenbaum, J.T. An update on the genetics of HLA B27-associated acute anterior uveitis. Ocul. Immunol. Inflamm. 19, 108–114 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  22. Takagi, I., Eliyas, J.K. & Stadlan, N. Cervical spondylosis: an update on pathophysiology, clinical manifestation, and management strategies. Dis. Mon. 57, 583–591 (2011).

    Article  PubMed  Google Scholar 

  23. Gritz, D.C. & Wong, I.G. Incidence and prevalence of uveitis in Northern California; the Northern California Epidemiology of Uveitis Study. Ophthalmology 111, 491–500, discussion 500 (2004).

    Article  PubMed  Google Scholar 

  24. Okada, Y. et al. Fine mapping major histocompatibility complex associations in psoriasis and its clinical subtypes. Am. J. Hum. Genet. 95, 162–172 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Tsoi, L.C. et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat. Genet. 44, 1341–1348 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Gutierrez-Achury, J. et al. Fine mapping in the MHC region accounts for 18% additional genetic risk for celiac disease. Nat. Genet. 47, 577–578 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Raychaudhuri, S. et al. Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis. Nat. Genet. 44, 291–296 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47, 898–905 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Moutsianas, L. et al. Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat. Genet. 47, 1107–1113 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Goyette, P. et al. High-density mapping of the MHC identifies a shared role for HLA-DRB1*01:03 in inflammatory bowel diseases and heterozygous advantage in ulcerative colitis. Nat. Genet. 47, 172–179 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Martínez-Taboda, V.M. et al. HLA-DRB1 allele distribution in polymyalgia rheumatica and giant cell arteritis: influence on clinical subgroups and prognosis. Semin. Arthritis Rheum. 34, 454–464 (2004).

    Article  PubMed  Google Scholar 

  32. Haworth, S. et al. Polymyalgia rheumatica is associated with both HLA-DRB1*0401 and DRB1*0404. Br. J. Rheumatol. 35, 632–635 (1996).

    Article  CAS  PubMed  Google Scholar 

  33. Cleynen, I. et al. Inherited determinants of Crohn's disease and ulcerative colitis phenotypes: a genetic association study. Lancet 387, 156–167 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Denny, J.C. et al. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 89, 529–542 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Eriksson, N. et al. Novel associations for hypothyroidism include known autoimmune risk loci. PLoS One 7, e34442 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Mosley, J.D. et al. Identifying genetically driven clinical phenotypes using linear mixed models. Nat. Commun. 7, 11433 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Parkes, M., Cortes, A., van Heel, D.A. & Brown, M.A. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat. Rev. Genet. 14, 661–673 (2013).

    Article  CAS  PubMed  Google Scholar 

  38. Chen, G.B. et al. Estimation and partitioning of (co)heritability of inflammatory bowel disease from GWAS and Immunochip data. Hum. Mol. Genet. 23, 4710–4720 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Cortes, A. et al. Major histocompatibility complex associations of ankylosing spondylitis are complex and involve further epistasis with ERAP1. Nat. Commun. 6, 7146 (2015).

    Article  PubMed  Google Scholar 

  42. Tsokos, G.C. Systemic lupus erythematosus. N. Engl. J. Med. 365, 2110–2121 (2011).

    Article  CAS  PubMed  Google Scholar 

  43. de Lusignan, S. et al. A method of identifying and correcting miscoding, misclassification and misdiagnosis in diabetes: a pilot and validation study of routinely collected data. Diabet. Med. 27, 203–209 (2010).

    Article  CAS  PubMed  Google Scholar 

  44. Nogueira, T.C. et al. GLIS3, a susceptibility gene for type 1 and type 2 diabetes, modulates pancreatic beta cell apoptosis via regulation of a splice variant of the BH3-only protein Bim. PLoS Genet. 9, e1003532 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Østergaard, J.A., Laugesen, E. & Leslie, R.D. Should there be concern about autoimmune diabetes in adults? Current evidence and controversies. Curr. Diab. Rep. 16, 82 (2016).

    Article  PubMed  Google Scholar 

  46. Cervin, C. et al. Genetic similarities between latent autoimmune diabetes in adults, type 1 diabetes, and type 2 diabetes. Diabetes 57, 1433–1437 (2008).

    Article  CAS  PubMed  Google Scholar 

  47. Shields, B.M. et al. Can clinical features be used to differentiate type 1 from type 2 diabetes? A systematic review of the literature. BMJ Open 5, e009088 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Jensen, A.B. et al. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 5, 4022 (2014).

    Article  CAS  PubMed  Google Scholar 

  49. Wain, L.V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Dilthey, A. et al. Multi-population classical HLA type imputation. PLOS Comput. Biol. 9, e1002877 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Motyer, A. et al. Practical use of methods for imputation of HLA alleles from SNP genotype data. Preprint at. bioRxiv http://dx.doi.org/10.1101/091009 (2016).

  52. International Multiple Sclerosis Genetics Consortium. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat. Genet. 45, 1353–1360 (2013).

  53. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).

    Article  CAS  PubMed  Google Scholar 

  54. Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Onengut-Gumuscu, S. et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat. Genet. 47, 381–386 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource (application number 10625). The research has been supported by the Wellcome Trust (095552/Z/11/Z to P.D., 100308/Z/12/Z to L.F., 100956/Z/13/Z to G.M., and 090532/Z/09/Z and 203141/Z/16/Z to the Wellcome Trust Centre for Human Genetics), the Danish National Research Foundation (grant number 126 to L.F.), the Wellcome Trust/Royal Society (204290/Z/16/Z to C.A.D.), Takeda, Ltd. (L.F. and C.A.D.), the Medical Research Council (grant number MC_UU_12010/3 to L.F.), and the Oak Foundation (OCAY-15-520 to L.F.). This work was supported by the Australian National Health and Medical Research Council (NHMRC), Career Development Fellowship 1053756 (S.L.), and by Victorian Life Sciences Computation Initiative (VLSCI) grant VR0240 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian government in Australia (S.L.). Research at the Murdoch Children's Research Institute was supported by the Victorian government's Operational Infrastructure Support Program.

Author information

Authors and Affiliations

Authors

Contributions

A.C. and G.M. performed the analyses with contributions from C.A.D. A.C., C.A.D., L.J., P.D., L.F. and G.M. conceived the study. A.M., D.V., A.D. and S.L. performed HLA imputation. A.C., C.A.D. and G.M. wrote the manuscript, and all other authors reviewed the manuscript.

Corresponding author

Correspondence to Gil McVean.

Ethics declarations

Competing interests

G.M. is a cofounder of, holder of shares in, and consultant to Genomics, PLC. P.D. is a cofounder of, holder of shares in, and director and executive officer of Genomics, PLC. A.D., G.M., P.D. and S.L. are partners in Peptide Groove, LLP. Peptide Groove has licensed HLA typing technology to Affymetrix, Ltd. The other authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 BFtree statistic and the number of non-zero nodes at a threshold of PP = 0.75 over the parameter space of θ and π1 for HLA-B*27:05 allele association with risk for clinical diagnoses in the HES data set.

Supplementary Figure 2 Comparison of rate of active node identification in TreeWAS and PheWAS analyses with simulated data.

Rate of active node identification at increasing posterior probability (PP) thresholds and different simulated allele frequencies of the causal genetic variant, for the TreeWAS method (θ = 1/3 and π1 = 0.001) and assuming a model with complete independence among phenotypes (θ → ∞ and π1 = 0.001), which is equivalent to PheWAS. We simulated data for 500 replicates where the genetic variant affects clinical annotations found distributed in the tree. The rate of active node identification was calculated for the five affected clinical annotations (circles) and for the rest of the annotations is the tree with zero genetic coefficients (diamonds).

Supplementary Figure 3 Sensitivity analysis for clustered active nodes.

Sensitivity analysis in the detection of genetic association at the tree level as measured by the BFtree statistic. We simulated data where the causal variant affected clustered nodes in the tree and fitted the TreeWAS method (blue) and the PheWAS models where we assume complete independence among phenotypes (orange) and where we assume complete independence among phenotypes and all nodes to be active (yellow).

Supplementary Figure 4 Sensitivity analysis for distributed active nodes.

Sensitivity analysis in the detection of genetic association at the tree level as measured by the BFtree statistic. We simulated data where the causal variant affected distributed nodes in the tree and fitted the TreeWAS method (blue) and the PheWAS models where we assume complete independence among phenotypes (orange) and where we assume complete independence among phenotypes and all nodes to be active (yellow).

Supplementary Figure 5 Linkage disequilibrium between identified HLA associations.

Linkage disequilibrium between independent HLA associations found in the analyses for the SR and HES data sets. Each allele shown was found in at least one of the analyses, seven of which were found in both. With the exception of the HLA-DRB1*15:01 and HLA-DQB1*06:02 alleles, all identified associations were not in linkage disequilibrium (r2 < 0.02). The HLA-DRB1*15:01 and HLA-DQB1*06:02 alleles were identified in the SR and HES analyses, respectively, and both are in high linkage disequilibrium (ρ = 0.98) and were fine-mapped to the same phenotypes.

Supplementary Figure 6 Effects of T1D on T2D diagnosis misclassification in TreeWAS summary statistics.

Individuals with a type 1 diabetes (T1D) diagnosis were misclassified as having type 2 diabetes (T2D) at different misclassification rates, ranging from 0.5 to 10% of the cohort with a T2D diagnosis. TreeWAS analysis was performed with both T1D and T2D GRSs in the SR and HES data sets. 100 simulations were generated for each misclassification rate. (a,b) Evidence of association at the tree level (BFtree) in the SR (a) and HES (b) data sets for the T1D and T2D GRSs. (c,d) Distribution of estimated posterior probabilities for the T1D and T2D diagnosis terms in the SR (c) and HES (d) data sets for both T1D and T2D GRS analyses. (e,f) Distribution of estimated effect sizes for the T1D and T2D terms in the SR (e) and HES (f) data sets for both T1D and T2D GRS analyses.

Supplementary Figure 7 Effects of T2D on T1D diagnosis misclassification in TreeWAS summary statistics.

Individuals with a type 2 diabetes (T2D) diagnosis were misclassified as having type 1 diabetes (T1D) at different misclassification rates, ranging from 0.5 to 10% of the cohort with a T1D diagnosis. TreeWAS analysis was performed with both T1D and T2D GRSs in the SR and HES data sets. 100 simulations were generated for each misclassification rate. (a,b) Evidence of association at the tree level (BFtree) in the SR (a) and HES (b) data sets for the T1D and T2D GRSs. (c,) Distribution of estimated posterior probabilities for the T1D and T2D diagnosis terms in the SR (c) and HES (d) data sets for both T1D and T2D GRS analyses. (e,f) Distribution of estimated effect sizes for the T1D and T2D terms in the SR (e) and HES (f) data sets for both T1D and T2D GRS analyses.

Supplementary Figure 8 Ancestry analysis of UK Biobank individuals using principal-component analysis.

120,286 individuals plotted in orange were retained in the analysis, and these co-cluster with European-ancestry populations.

Supplementary Figure 9 Prior on effect sizes for the full genetic model.

β1 and β 2 are the log-odds coefficients for heterozygotes and homozygotes, respectively. The heat map indicates the relative density of the prior.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–9 and Supplementary Note (PDF 2359 kb)

Supplementary Tables 1–12

Supplementary Tables 1–12 (XLSX 1191 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cortes, A., Dendrou, C., Motyer, A. et al. Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. Nat Genet 49, 1311–1318 (2017). https://doi.org/10.1038/ng.3926

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3926

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing