Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes

Lakhani, Chirag M.; Tierney, Braden T.; Manrai, Arjun K.; Yang, Jian; Visscher, Peter M.; Patel, Chirag J.

doi:10.1038/s41588-018-0313-7

Analysis
Published: 14 January 2019

Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes

Nature Genetics volume 51, pages 327–334 (2019)Cite this article

15k Accesses
41 Citations
580 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 27 February 2019

This article has been updated

Abstract

We analysed a large health insurance dataset to assess the genetic and environmental contributions of 560 disease-related phenotypes in 56,396 twin pairs and 724,513 sibling pairs out of 44,859,462 individuals that live in the United States. We estimated the contribution of environmental risk factors (socioeconomic status (SES), air pollution and climate) in each phenotype. Mean heritability (h² = 0.311) and shared environmental variance (c² = 0.088) were higher than variance attributed to specific environmental factors such as zip-code-level SES (var_SES = 0.002), daily air quality (var_AQI = 0.0004), and average temperature (var_temp = 0.001) overall, as well as for individual phenotypes. We found significant heritability and shared environment for a number of comorbidities (h² = 0.433, c² = 0.241) and average monthly cost (h² = 0.290, c² = 0.302). All results are available using our Claims Analysis of Twin Correlation and Heritability (CaTCH) web application.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Geographic distribution of 56,396 twin pairs in CaTCH and an example of environmental data aggregation on a zip code basis.**

**Fig. 2: Estimates of twin statistics across functional domains and individual basis for 56,396 twin pairs in CaTCH among all 560 phenotypes.**

**Fig. 3: Comparison of h² estimates in CaTCH to published literature and estimates for cost and comorbidities in CaTCH.**

**Fig. 4: Comparison of h²/c² estimates from 56,396 twin pairs among 560 phenotypes in CaTCH to 5,169,880 twin pairs among 9,568 phenotypes in MaTCH (Supplementary Table 1).**

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Genome-wide association studies

Article 26 August 2021

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Article Open access 30 April 2024

Data Availability

The data that support the findings of this study are available from Aetna Insurance, but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. Please contact N. Palmer (nathan_palmer@hms.harvard.edu) for inquiries about the Aetna dataset. Summary data are, however, available from the authors upon reasonable request and with permission of Aetna Insurance. Code for analysis, generation of figures and figure files is available at https://github.com/cmlakhan/twinInsurance.

Change history

27 February 2019
In the version of this article initially published, in Fig. 4b, the shared environmental variance (c²) values for all MaTCH functional domains except ‘all traits’ were erroneously estimated because of a coding error. Figure 4 has been revised to include corrected c² estimates in the data in panel b as well as the number of phenotypes in CaTCH and MaTCH functional domains in the y axes of panels a and b; the Fig. 4 legend and the description of Fig. 4b in the Results section have also been revised to describe these changes. In addition, the erroneous term ‘depravity index’, appearing throughout the article’s main text, Fig. 1, Supplementary Fig. 10 and the Supplementary Note, should have read ‘deprivation index’. The errors have been corrected in the HTML and PDF versions of the article. Images of the original figure are shown in the correction notice.

References

Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372, 793–795 (2015).
Article CAS Google Scholar
Roberts, N. J. et al. The predictive capacity of personal genome sequencing. Sci. Transl. Med. 4, 133ra58–133ra58 (2012).
Article Google Scholar
Wray, N. R., Yang, J., Goddard, M. E. & Visscher, P. M. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 6, e1000864 (2010).
Article Google Scholar
Wang, K., Gaitsch, H., Poon, H., Cox, N. J. & Rzhetsky, A. Classification of common human diseases derived from shared genetic and environmental determinants. Nat. Genet. 49, 1319–1325 (2017).
Article CAS Google Scholar
Polubriaginof, F. C. G. et al. Disease heritability inferred from familial relationships reported in medical records. Cell 173, 1692–1704.e11 (2018).
Article CAS Google Scholar
Benyamin, B., Wilson, V., Whalley, L. J., Visscher, P. M. & Deary, I. J. Large, consistent estimates of the heritability of cognitive ability in two entire populations of 11-year-old twins from Scottish mental surveys of 1932 and 1947. Behav. Genet. 35, 525–534 (2005).
Article Google Scholar
Graham, G. N. Why your zip code matters more than your genetic code: promoting healthy outcomes from mother to child. Breastfeed. Med. 11, 396–397 (2016).
Article Google Scholar
Slade-Sawyer, P. Is health determined by genetic code or zip code? Measuring the health of groups and improving population health. N. C. Med. J. 75, 394–397 (2014).
PubMed Google Scholar
Heckerman, D. et al. Linear mixed model for heritability estimation that explicitly addresses environmental variation. Proc. Natl Acad. Sci. USA 113, 7377–7382 (2016).
Article CAS Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Article CAS Google Scholar
Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
Article Google Scholar
Polderman, T. J. C. et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat. Genet. 47, 702–709 (2015).
Article CAS Google Scholar
van Dongen, J., Eline Slagboom, P., Draisma, H. H. M., Martin, N. G. & Boomsma, D. I. The continuing value of twin studies in the omics era. Nat. Rev. Genet. 13, 640–653 (2012).
Article Google Scholar
Docherty, A. R. et al. Comparison of twin and extended pedigree designs for obtaining heritability estimates. Behav. Genet. 45, 461–466 (2015).
Article Google Scholar
Liu, C. et al. Revisiting heritability accounting for shared environmental effects and maternal inheritance. Hum. Genet. 134, 169–179 (2015).
Article Google Scholar
Loh, P.-R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).
Article CAS Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Article CAS Google Scholar
Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 44, 247–250 (2012).
Article CAS Google Scholar
Dieleman, J. L. et al. US Spending on personal health care and public health, 1996–2013. JAMA 316, 2627–2646 (2016).
Article Google Scholar
McWilliams, J. M. & Schwartz, A. L. Focusing on high-cost patients - the key to addressing high costs? N. Engl. J. Med. 376, 807–809 (2017).
Article Google Scholar
Richesson, R. L. et al. A comparison of phenotype definitions for diabetes mellitus. J. Am. Med. Inform. Assoc. 20, e319–e326 (2013).
Article Google Scholar
Krieger, N. et al. Choosing area based socioeconomic measures to monitor social inequalities in low birth weight and childhood lead poisoning: the public health disparities geocoding project (US). J. Epidemiol. Community Health 57, 186–199 (2003).
Article CAS Google Scholar
Blair, D. R. et al. A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk. Cell 155, 70–80 (2013).
Article CAS Google Scholar
Huff, S. M. et al. Development of the logical observation identifier names and codes (LOINC) vocabulary. J. Am. Med. Inform. Assoc. 5, 276–292 (1998).
Article CAS Google Scholar
Visscher, P. M., Benyamin, B. & White, I. The use of linear mixed models to estimate variance components from data on twin pairs by maximum likelihood. Twin. Res. 7, 670–674 (2004).
Article Google Scholar
Beasley, T. M., Erickson, S. & Allison, D. B. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 39, 580–595 (2009).
Article Google Scholar
Reich, T., James, J. W. & Morris, C. A. The use of multiple thresholds in determining the mode of transmission of semi-continuous traits. Ann. Hum. Genet. 36, 163–184 (1972).
Article CAS Google Scholar
Falconer, D. S. & Mackay, T. C. Introduction to Quantitative Genetics (John Wiley & Sons. Inc., New York,, 1989).
Google Scholar
Weinberg, W. Beiträge zur Physiologie und Pathologie der Mehrlingsgeburten beim Menschen. Pflugers Arch. Gesamte Physiol. Menschen Tiere 88, 346–430 (1901).
Article Google Scholar
Neale, M. C. A finite mixture distribution model for data collected from twins. Twin. Res. 6, 235–239 (2003).
Article Google Scholar
Scarr-Salapatek, S. Race, social class, and IQ. Science 174, 1285–1295 (1971).
Article CAS Google Scholar
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).
Article Google Scholar
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48 (2015).
Article Google Scholar
R. C. Team R: A language and environment for statistical computing (R Foundation for Statistical Computing, 2014).
Viechtbauer, W. Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 36, 1–48 (2010).
Article Google Scholar
DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials 7, 177–188 (1986).
Article CAS Google Scholar
Qi, T. et al. Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood. Nat. Commun. 9, 2282 (2018).
Article Google Scholar

Download references

Acknowledgements

We thank K. Fox of Aetna, Inc., N. Palmer of Harvard Medical School, and I. Kohane of Harvard Medical School for support and providing access to the Aetna Insurance Claims Data. We are grateful to L. O’Connor and A. Price for helpful discussion. This research was supported by the Australian National Health and Medical Research Council (1078037 and 1113400), National Institutes of Health NIEHS (R00ES23504 and R21ES205052), the National Science Foundation (1636870), and the Sylvia & Charles Viertel Charitable Foundation.

Author information

These authors jointly supervised this work: Peter M. Visscher, Chirag J. Patel.

Authors and Affiliations

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Chirag M. Lakhani, Braden T. Tierney, Arjun K. Manrai & Chirag J. Patel
Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA
Braden T. Tierney
Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, USA
Arjun K. Manrai
Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
Jian Yang & Peter M. Visscher
Queensland Brain Institute, The University of Queensland, Brisbane, Australia
Jian Yang & Peter M. Visscher

Authors

Chirag M. Lakhani
View author publications
You can also search for this author in PubMed Google Scholar
Braden T. Tierney
View author publications
You can also search for this author in PubMed Google Scholar
Arjun K. Manrai
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Peter M. Visscher
View author publications
You can also search for this author in PubMed Google Scholar
Chirag J. Patel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed extensively to the work presented in this paper. C.M.L., P.M.V., and C.J.P. designed experiments, analysed data, and wrote the manuscript. B.T.T. developed the Shiny App for analysis. B.T.T., A.K.M., and J.Y. contributed to iterative improvement of the manuscript.

Corresponding authors

Correspondence to Peter M. Visscher or Chirag J. Patel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplement Notes 1–6 and Supplementary Tables 1–4 and 6

Reporting Summary

Supplementary Table 5

Comparison of h2 estimates from claims analysis to h2 estimates from 81 published studies, including the method of estimation

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakhani, C.M., Tierney, B.T., Manrai, A.K. et al. Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes. Nat Genet 51, 327–334 (2019). https://doi.org/10.1038/s41588-018-0313-7

Download citation

Received: 16 April 2018
Accepted: 07 November 2018
Published: 14 January 2019
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41588-018-0313-7

This article is cited by

A data-driven approach to quantify disparities in power outages
- Arkaprabha Bhattacharyya
- Makarand Hastak
Scientific Reports (2023)
To promote healthy aging, focus on the environment
- Daniel W. Belsky
- Andrea A. Baccarelli
Nature Aging (2023)
Demographic Predictors of Complete Well-Being
- Matthew T. Lee
- Eileen McNeely
- Tyler J. VanderWeele
BMC Public Health (2022)
Genome-wide association analyses of common infections in a large practice-based biobank
- Lan Jiang
- V. Eric Kerchberger
- QiPing Feng
BMC Genomics (2022)
Paternal age and 13 psychiatric disorders in the offspring: a population-based cohort study of 7 million children in Taiwan
- Shi-Heng Wang
- Chi-Shin Wu
- Chun-Chieh Fan
Molecular Psychiatry (2022)

Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes

Subjects

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Genome-wide association studies

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Data Availability

Change history

27 February 2019

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Text and Figures

Reporting Summary

Supplementary Table 5

Rights and permissions

About this article

Cite this article

This article is cited by

A data-driven approach to quantify disparities in power outages

To promote healthy aging, focus on the environment

Demographic Predictors of Complete Well-Being

Genome-wide association analyses of common infections in a large practice-based biobank

Paternal age and 13 psychiatric disorders in the offspring: a population-based cohort study of 7 million children in Taiwan

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data Availability

Change history

27 February 2019

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links