Classification of common human diseases derived from shared genetic and environmental determinants


In this study, we used insurance claims for over one-third of the entire US population to create a subset of 128,989 families (481,657 unique individuals). We then used these data to (i) estimate the heritability and familial environmental patterns of 149 diseases and (ii) infer the genetic and environmental correlations for disease pairs from a set of 29 complex diseases. The majority (52 of 65) of our study's heritability estimates matched earlier reports, and 84 of our estimates appear to have been obtained for the first time. We used correlation matrices to compute environmental and genetic disease classifications and corresponding reliability measures. Among unexpected observations, we found that migraine, typically classified as a disease of the central nervous system, appeared to be most genetically similar to irritable bowel syndrome and most environmentally similar to cystitis and urethritis, all of which are inflammatory diseases.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Information on study population, results of model selection, and analysis of heritability of 149 diseases.
Figure 2: Genetic and environmental correlations between diseases.


  1. 1

    van de Water, T., Suliman, S. & Seedat, S. Gender and cultural issues in psychiatric nosological classification systems. CNS Spectr. 21, 334–340 (2016).

    Article  Google Scholar 

  2. 2

    Kendler, K.S. The nature of psychiatric disorders. World Psychiatry 15, 5–12 (2016).

    Article  Google Scholar 

  3. 3

    Endlicher, S. Genera Plantarum Secundum Ordines Naturales Disposita (F. Beck, 1836).

  4. 4

    Jussieu, A.L.d. & Stafleu, F.A. Genera Plantarum (Upsaliæ:apud. J. Cramer; Stechert-Hafner Service Agency, 1964).

  5. 5

    Linné, C.v. et al. The Families of Plants: With Their Natural Characters, According to the Number, Figure, Situation, and Proportion of All of the Parts of Fructification (John Jackson, 1787).

  6. 6

    Thunberg, K.P. et al. Nova Genera Plantarum (Upsaliæ :apud. J. Edman etc., 1781).

  7. 7

    Anderson, M.J. Carl Linnaeus: Genius of Classification (Enslow Publishers, 2015).

  8. 8

    Felsenstein, J. Inferring Phylogenies (Sinauer Associates, 2004).

  9. 9

    Suthram, S. et al. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput. Biol. 6, e1000662 (2010).

    Article  Google Scholar 

  10. 10

    Fisher, R.A. XV.—the correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52, 399–433 (1918).

    Article  Google Scholar 

  11. 11

    Wright, S. Systems of mating. I. The biometric relations between parent and offspring. Genetics 6, 111–123 (1921).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Lynch, M. & Walsh, B. Genetics and Analysis of Quantitative Traits (Sinauer, 1998).

  13. 13

    Gelman, A. Bayesian Data Analysis 3rd edn. (CRC Press, 2014).

  14. 14

    Hadfield, J.D. MCMC methods for multi-response generalized linear mixed models: the MCMCglmm R package. J. Stat. Softw. 33, 1–22 (2010).

    Article  Google Scholar 

  15. 15

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B Met. 57, 289–300 (1995).

    Google Scholar 

  16. 16

    Lichtenstein, P. et al. Common genetic determinants of schizophrenia and bipolar disorder in Swedish families: a population-based study. Lancet 373, 234–239 (2009).

    CAS  Article  Google Scholar 

  17. 17

    Boyle, E.A., Li, Y.I. & Pritchard, J.K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).

    CAS  Article  Google Scholar 

  18. 18

    Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).

    CAS  Google Scholar 

  19. 19

    Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans (Society for Industrial and Applied Mathematics, 1982).

  20. 20

    Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).

    Article  Google Scholar 

  21. 21

    Efron, B. The bootstrap and Markov-chain Monte Carlo. J. Biopharm. Stat. 21, 1052–1062 (2011).

    Article  Google Scholar 

  22. 22

    Farh, K.K. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).

    CAS  Article  Google Scholar 

  23. 23

    Gormley, P. et al. Meta-analysis of 375,000 individuals identifies 38 susceptibility loci for migraine. Nat. Genet. 48, 856–866 (2016).

    CAS  Article  Google Scholar 

  24. 24

    Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

    CAS  Article  Google Scholar 

  25. 25

    Xia, C. et al. Pedigree- and SNP-associated genetics and recent environment are the major contributors to anthropometric and cardiometabolic trait variation. PLoS Genet. 12, e1005804 (2016).

    Article  Google Scholar 

  26. 26

    Schildkraut, J.M., Risch, N. & Thompson, W.D. Evaluating genetic association among ovarian, breast, and endometrial cancer: evidence for a breast/ovarian cancer relationship. Am. J. Hum. Genet. 45, 521–529 (1989).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27

    Davis, L.K. et al. Partitioning the heritability of Tourette syndrome and obsessive compulsive disorder reveals differences in genetic architecture. PLoS Genet. 9, e1003864 (2013).

    Article  Google Scholar 

  28. 28

    Lee, S.H. et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45, 984–994 (2013).

    CAS  Article  Google Scholar 

  29. 29

    Loh, P.R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).

    CAS  Article  Google Scholar 

  30. 30

    Muñoz, M. et al. Evaluating the contribution of genetics and familial shared environment to common disease using the UK Biobank. Nat. Genet. 48, 980–983 (2016).

    Article  Google Scholar 

  31. 31

    Vattikuti, S., Guo, J. & Chow, C.C. Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genet. 8, e1002637 (2012).

    CAS  Article  Google Scholar 

  32. 32

    Liu, C. et al. Revisiting heritability accounting for shared environmental effects and maternal inheritance. Hum. Genet. 134, 169–179 (2015).

    Article  Google Scholar 

  33. 33

    Zuk, O., Hechter, E., Sunyaev, S.R. & Lander, E.S. The mystery of missing heritability: genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA 109, 1193–1198 (2012).

    CAS  Article  Google Scholar 

  34. 34

    Zaitlen, N. et al. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet. 9, e1003520 (2013).

    CAS  Article  Google Scholar 

  35. 35

    Wray, N.R. & Maier, R. Genetic basis of complex genetic disease: the contribution of disease heterogeneity to missing heritability. Curr. Epidemiol. Rep. 1, 220–227 (2014).

    Article  Google Scholar 

  36. 36

    Ojodu, J., Hulihan, M.M., Pope, S.N. & Grant, A.M. Incidence of sickle cell trait—United States, 2010. MMWR Morb. Mortal. Wkly. Rep. 63, 1155–1158 (2014).

    PubMed  PubMed Central  Google Scholar 

  37. 37

    Denny, J.C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210 (2010).

    CAS  Article  Google Scholar 

  38. 38

    Korsgaard, I.R. et al. Multivariate Bayesian analysis of Gaussian, right censored Gaussian, ordered categorical and binary traits using Gibbs sampling. Genet. Sel. Evol. 35, 159–183 (2003).

    Article  Google Scholar 

  39. 39

    Falconer, D. & Mackay, T. Introduction to Quantitative Genetics 4th edn. (Longman Scientific and Technical, 1996).

  40. 40

    Falconer, D.S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 29, 51–76 (1965).

    Article  Google Scholar 

  41. 41

    Sorensen, D. & Gianola, D. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics (Springer-Verlag, 2002).

  42. 42

    Rodriguez, G. & Goldman, N. An assessment of estimation procedures for multilevel models with binary responses. J. R. Stat. S`. Ser. A Stat. Soc. 158, 73–89 (1995).

    Article  Google Scholar 

  43. 43

    de Villemereuil, P., Gimenez, O. & Doligez, B. Comparing parent–offspring regression with frequentist and Bayesian animal models to estimate heritability in wild populations: a simulation study for Gaussian and binary traits. Methods Ecol. Evol. 4, 260–275 (2013).

    Article  Google Scholar 

  44. 44

    Gelman, A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal. 1(3), 515–534 (2006).

    Article  Google Scholar 

  45. 45

    Gelman, A. & Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–511 (1992).

    Article  Google Scholar 

  46. 46

    Heidelberger, P. & Welch, P.D. Simulation run length control in the presence of an initial transient. Opns Res. 31, 1109–1144 (1983).

    Article  Google Scholar 

  47. 47

    Plummer, M., Best, N., Cowles, K. & Vines, K. CODA: Convergence Diagnosis and Output Analysis for MCMC. R News 6, 7–11 (2006).

    Google Scholar 

  48. 48

    Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).

    Article  Google Scholar 

  49. 49

    Spiegelhalter, D.J., Best, N.G., Carlin, B.P. & Van Der Linde, A. Bayesian measures of model complexity and fit. J. Royal Stat. Soc. B Stat. Methodol. 64, 583–639 (2002).

    Article  Google Scholar 

  50. 50

    Bérénos, C., Ellis, P.A., Pilkington, J.G. & Pemberton, J.M. Estimating quantitative genetic parameters in wild populations: a comparison of pedigree and genomic approaches. Mol. Ecol. 23, 3434–3451 (2014).

    Article  Google Scholar 

  51. 51

    Charmantier, A. & Réale, D. How do misassigned paternities affect the estimation of heritability in the wild? Mol. Ecol. 14, 2839–2850 (2005).

    CAS  Article  Google Scholar 

  52. 52

    Morrissey, M.B., Wilson, A.J., Pemberton, J.M. & Ferguson, M.M. A framework for power and sensitivity analyses for quantitative genetic studies of natural populations, and case studies in Soay sheep (Ovis aries). J. Evol. Biol. 20, 2309–2321 (2007).

    CAS  Article  Google Scholar 

  53. 53

    Kreider, R.M. & Lofquist, D.A. Adopted children and stepchildren: 2010. P20-572. (US Census Bureau, 2014).

  54. 54

    Anttila, V. et al. Analysis of shared heritability in common disorders of the brain. Preprint at bioRxiv. (2016).

  55. 55

    Pippitt, K., Li, M. & Gurgle, H.E. Diabetes mellitus: screening and diagnosis. Am. Fam. Physician 93, 103–109 (2016).

    PubMed  Google Scholar 

Download references


We thank E. Gannon, R. Melamed, R. Mork, and M. Rzhetsky for numerous comments on earlier versions of the manuscript. This work was funded by the DARPA Big Mechanism program under ARO contract W911NF1410333, by National Institutes of Health grants R01HL122712, 1P50MH094267, and U01HL108634-01, and by a gift from Liz and Kent Dauten.

Author information




All authors contributed extensively to the work presented in this paper. K.W. and A.R. designed experiments, analyzed data, and wrote the manuscript; K.W., H.G., and H.P. performed computational experiments; and N.J.C., H.G., and H.P. contributed to iterative improvement of the manuscript.

Corresponding author

Correspondence to Andrey Rzhetsky.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Environmental effects estimates.

(a) Common couple environment effects. (b) Common sibling environment effects. (c) Unique environment effects. Bar color in the bar plots indicates biological systems associated with each disease, consistent throughout all figures.

Supplementary Figure 2 Testing dependence of heritability estimates on age of onset; heritability distributions, sorted by biological system.

(a) Histograms and density plots of heritability estimates by biological system. (b) Heritability estimate versus disease age of onset for biological systems with more than three diseases, with linear fits indicated by solid lines.

Supplementary Figure 3 Positive correlations between phenotypic and genetic correlations and between phenotypic and environmental correlations.

Supplementary Figure 4 Classification trees: ICD-9 versus phenotypic correlations.

(a) A classification of diseases that corresponds to a subset of ICD-9 taxonomy. (b) Disease classification constructed from phenotypic correlations between diseases; distances between diseases were calculated as 1 – correlation.

Supplementary Figure 5 Neighbor-joining classifications showing the 29 conditions’ nosologies inferred from genetic and environmental correlations presented on the left and the right trees, respectively.

For both classifications, we defined the distance between diseases as 1 – correlation. Because we estimated a posterior distribution for each correlation estimate, we were able to sample 10,000 distance sets using posterior distributions for pairwise correlations. For each of these samples, we estimated a classification and computed reliability measures for individual classification topology partitions (each integer number on the tree indicates the percentage of trees out of 10,000 in which this particular partition was present). Disease labels are colored according to associated biological systems, consistent with other figures. Note that, while the genetic and environmental trees are significantly different, both are stable, as the bootstrap-like numbers indicate.

Supplementary Figure 6 Estimates of age-related increase in disease liability for seven late-onset conditions (aneurysm, atherosclerosis, benign colon neoplasm, cataract, cerebrovascular disease, keratosis, and osteoarthritis).

Error bars show 1 s.d., and LOcally WEighted Scatter-plot Smoother (LOWESS) curve fits are shown with solid lines.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6 and Supplementary Tables 3 and 5–8 (PDF 1883 kb)

Supplementary Table 1

Acronyms, biological systems, prevalence percentages and standard errors for 149 studied diseases. (XLSX 74 kb)

Supplementary Table 2

Heritability and preventability estimates and standard deviations for 149 studied diseases. (XLSX 83 kb)

Supplementary Table 4

Pairwise estimates and standard deviations of genetic, environmental and phenotypic correlations for 29 diseases. (XLSX 64 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, K., Gaitsch, H., Poon, H. et al. Classification of common human diseases derived from shared genetic and environmental determinants. Nat Genet 49, 1319–1325 (2017).

Download citation

Further reading