Using electronic health records to drive discovery in disease genomics

Key Points

  • The recently projected sample-size requirements for genomic studies for common variants of modest effect size, and rare variants of larger effect size, are rapidly outpacing the capabilities and budgets of most investigators and organizations.

  • The ongoing substantial investment in electronic health records (EHRs) for clinical care can be leveraged to cost-effectively accelerate population-scale genomic research by at least an order of magnitude and to reduce costs by at least an order of magnitude.

  • EHR-driven genomic research (EDGR) uses both the codified data available in the EHR and the phenotypic characterizations buried in the narrative text of the record by means of natural language processing (NLP).

  • Among the advantages of EDGR versus conventional cohort studies are timely clinical relevance and cost-effective scalability.

  • Biobanking and existing cohort studies will increasingly use EHR-derived data to augment the phenotypic characterizations obtained.

  • EDGR studies have already shown they can reproduce conventionally run genome-wide association (GWA) studies and extend the findings of those prior GWA studies to additional, often underrepresented populations.

  • EDGR can also enable studies that would be difficult to conduct otherwise, such as calculating the effect size of a genetic variant not for one disease or trait but for all diseases and traits captured in the EHR (a so-called phenome-wide association study).

  • A thorny challenge is the broad and international adoption of a standardized consent model or regulatory framework. If unaddressed, this challenge may impede further rapid adoption of EDGR. The patchy implementation of EHRs and their very large costs also will slow adoption of EDGR techniques.


If genomic studies are to be a clinically relevant and timely reflection of the relationship between genetics and health status — whether for common or rare variants — cost-effective ways must be found to measure both the genetic variation and the phenotypic characteristics of large populations, including the comprehensive and up-to-date record of their medical treatment. The adoption of electronic health records, used by clinicians to document clinical care, is becoming widespread and recent studies demonstrate that they can be effectively employed for genetic studies using the informational and biological 'by-products' of health-care delivery while maintaining patient privacy.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: From clinical notes to structured phenotypes.
Figure 2: Two archetypal workflows in electronic health record-driven genomic research.


  1. 1

    Green, E. D., Guyer, M. S. & National Human Genome Research Institute. Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204–213 (2011).

  2. 2

    Ioannidis, J. P., Trikalinos, T. A. & Khoury, M. J. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614 (2006).

  3. 3

    Dina, C. New insights into the genetics of body weight. Curr. Opin. Clin. Nutr. Metab. Care 11, 378–384 (2008).

  4. 4

    Gauderman, W. J. Sample size requirements for association studies of gene–gene interaction. Am. J. Epidemiol. 155, 478–484 (2002).

  5. 5

    Hein, R., Beckmann, L. & Chang-Claude, J. Sample size requirements for indirect association studies of gene–environment interactions (G x E). Genet. Epidemiol. 32, 235–245 (2008).

  6. 6

    Manolio, T. A., Bailey-Wilson, J. E. & Collins, F. S. Genes, environment and the value of prospective cohort studies. Nature Rev. Genet. 7, 812–820 (2006).

  7. 7

    Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

  8. 8

    Gismondi, P. M. et al. Strategies, time, and costs associated with the recruitment and enrollment of nursing home residents for a micronutrient supplementation clinical trial. J. Gerontol. A Biol. Sci. Med. Sci. 60, 1469–1474 (2005).

  9. 9

    Noble, S. et al. Feasibility and cost of obtaining informed consent for essential review of medical records in large-scale health services research. J. Health Serv. Res. Policy 14, 77 (2009).

  10. 10

    Schroy, P. C. et al. A cost-effectiveness analysis of subject recruitment strategies in the HIPAA era: results from a colorectal cancer screening adherence trial. Clin. Trials 6, 597–609 (2009).

  11. 11

    Zika, E. et al. A European survey on biobanks: trends and issues. Public Health Genomics 14, 96–103 (2010).

  12. 12

    Tutton, R., Kaye, J. & Hoeyer, K. Governing UK Biobank: the importance of ensuring public trust. Trends Biotechnol. 22, 284–285 (2004).

  13. 13

    Nakamura, Y. The BioBank Japan Project. Clin. Adv. Hematol. Oncol. 5, 696–697 (2007).

  14. 14

    Hawkins, A. K. Biobanks: importance, implications and opportunities for genetic counselors. J. Genet. Couns. 19, 423–429 (2010).

  15. 15

    Hewitt, R. E. Biobanking: the foundation of personalized medicine. Curr. Opin. Oncol. 23, 112–119 (2011).

  16. 16

    Ballantyne, C. Report urges Europe to combine wealth of biobank data. Nature Med. 14, 701 (2008).

  17. 17

    Founti, P. et al. Biobanks and the importance of detailed phenotyping: a case study — the European Glaucoma Society GlaucoGENE project. Br. J. Ophthalmol. 93, 577–581 (2009).

  18. 18

    Tunis, S. R., Stryer, D. B. & Clancy, C. M. Practical clinical trials: increasing the value of clinical research for decision making in clinical and health policy. JAMA 290, 1624–1632 (2003).

  19. 19

    Charlson, M. E. & Horwitz, R. I. Applying results of randomised trials to clinical practice: impact of losses before randomisation. BMJ (Clin. Res. Ed.) 289, 1281–1284 (1984).

  20. 20

    Pablos-Méndez, A., Barr, R. G. & Shea, S. Run-in periods in randomized trials: implications for the application of results in clinical practice. JAMA 279, 222–225 (1998).

  21. 21

    August, J. Market watch: emerging companion diagnostics for cancer drugs. Nature Rev. Drug Discov. 9, 351 (2010).

  22. 22

    Brownstein, J. S., Freifeld, C. C., Reis, B. Y. & Mandl, K. D. Surveillance Sans Frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med. 5, e151 (2008).

  23. 23

    Kielbasa, A. M., Pomerantz, A. M., Krohn, E. J. & Sullivan, B. F. How does clients' method of payment influence psychologists' diagnostic decisions? Ethics Behav. 14, 187–195 (2004).

  24. 24

    Tuckson, R. V. et al. Policy issues associated with undertaking a new large, U. S. population cohort study of genes, environment, and disease. Department of Health and Human Services, Washington DC [online], (2007). A Landmark report by the US Department of Health and Human Services on the value of large cohort genetic studies of one million or more subjects and the attendant costs and challenges.

  25. 25

    Murphy, S. et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res. 19, 1675–1681 (2009). A summary of the i2b2 approach to EDGR along with detailed estimates of the financial costs of conducting EDGR.

  26. 26

    Beasley, D. Remembering recruitment: the impact of proactive subject recruitment planning. Applied Clinical Trials Online [online], (2008).

  27. 27

    Jha, A. K. et al. Use of electronic health records in U. S. hospitals. N. Engl. J. Med. 360, 1628–1638 (2009). A cautionary survey of the lack of implementation of comprehensive EHRs in the United States.

  28. 28

    Collins, F. S., Green, E. D., Guttmacher, A. E., Guyer, M. S. & US National Human Genome Research Institute. A vision for the future of genomics research. Nature 422, 835–847 (2003).

  29. 29

    Ranganathan, M. & Bhopal, R. Exclusion and inclusion of nonwhite ethnic minority groups in 72 North American and European cardiovascular cohort studies. PLoS Med. 3, e44 (2006).

  30. 30

    Stone, V. E., Mauch, M. Y., Steger, K., Janas, S. F. & Craven, D. E. Race, gender, drug use, and participation in AIDS clinical trials. Lessons from a municipal hospital cohort. J. Gen. Intern. Med. 12, 150–157 (1997).

  31. 31

    Larson, E. Exclusion of certain groups from clinical research. Image J. Nurs. Sch. 26, 185–190 (1994).

  32. 32

    Michelen, W., Martinez, J., Lee, A. & Wheeler, D. P. Reducing frequent flyer emergency department visits. J. Health Care Poor Underserved 17, 59–69 (2006).

  33. 33

    Roby, D. H., Nicholson, G. L. & Kominski, G. F. African Americans in commercial HMOs more likely to delay prescription drugs and use the emergency room. UCLA Center for Health and Policy Research [online], (2009).

  34. 34

    Jones, R., Lin, S., Munsie, J. P., Radigan, M. & Hwang, S. A. Racial/ethnic differences in asthma-related emergency department visits and hospitalizations among children with wheeze in Buffalo, New York. J. Asthma 45, 916–922 (2008).

  35. 35

    Wolff, J. L., Starfield, B. & Anderson, G. Prevalence, expenditures, and complications of multiple chronic conditions in the elderly. Arch. Intern. Med. 162, 2269–2276 (2002).

  36. 36

    Brownstein, J. S. et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 33, 526–531 (2010). A demonstration of the use of EHR data for timely identification of medically relevant trends; in this case the increased cardiovascular-related mortality associated with a specific oral hypoglycaemic agent.

  37. 37

    Brownstein, J. S., Sordo, M., Kohane, I. S. & Mandl, K. D. The tell-tale heart: population-based surveillance reveals an association of rofecoxib and celecoxib with myocardial infarction. PLoS ONE 2, e840 (2007).

  38. 38

    McCarty, C. A. & Wilke, R. A. Biobanking and pharmacogenomics. Pharmacogenomics 11, 637–641 (2010).

  39. 39

    Kosoy, R. et al. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum. Mutat. 30, 69–78 (2009).

  40. 40

    Dumitrescu, L. et al. Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records. Genet. Med. 12, 648–650 (2010).

  41. 41

    Ioannidis, J. P. A. Non-replication and inconsistency in the genome-wide association setting. Hum. Hered. 64, 203–213 (2007).

  42. 42

    Gulcher, J. & Stefansson, K. deCODE: A genealogical approach to human genetics in Iceland. Wiley Online Library [online], (2006).

  43. 43

    Murphy, S. N., Mendis, M. E., Berkowicz, D. A., Kohane, I. S. & Chueh, H. C. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu. Symp. Proc., 1040 (2006).

  44. 44

    Roden, D. M. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther. 84, 362–369 (2008). A detailed description of the implementation of EDGR in an institution that is one of the leaders in this domain.

  45. 45

    Clayton, E. et al. Confronting real time ethical, legal, and social issues in the Electronic Medical Records and Genomics (eMERGE) Consortium. Genet. Med. 12, 616–620 (2010). A useful summary of the various ethical and legal controversies that are entailed by EDGR.

  46. 46

    Kullo, I. J., Ding, K., Jouni, H., Smith, C. Y. & Chute, C. G. A genome-wide association study of red blood cell traits using the electronic medical record. PLoS ONE 5, e13011 (2010).

  47. 47

    Ritchie, M. et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am. J. Hum. Genet. 86, 560–572 (2010). One of the earliest examples of conventional cohort GWA study results being reproduced using EDGR.

  48. 48

    Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011). Early example of extending a GWA study result to other populations using EDGR.

  49. 49

    Melton, G. B. et al. Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report. J. Am. Med. Inform. Assoc. 17, 337–340 (2010).

  50. 50

    Sager, N., Lyman, M., Bucknall, C., Nhan, N. & Tick, L. J. Natural language processing and the representation of clinical data. J. Am. Med. Inform. Assoc. 1, 142–160 (1994).

  51. 51

    Lindberg, D. A., Humphreys, B. L. & McCray, A. T. The unified medical language system. Methods Inf. Med. 32, 281–291 (1993).

  52. 52

    Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. (Hoboken) 62, 1120–1127 (2010). A detailed description of the application of NLP in EDGR and estimates of its accuracy.

  53. 53

    Uzuner, O., Goldstein, I., Luo, Y. & Kohane, I. Identifying patient smoking status from medical discharge records. J. Am. Med. Inform. Assoc. 15, 14–24 (2008).

  54. 54

    Jones, R., Pembrey, M., Golding, J. & Herrick, D. The search for genenotype/phenotype associations and the phenome scan. Paediatr. Perinat. Epidemiol. 19, 264–275 (2005).

  55. 55

    Denny, J. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210 (2010). An impressive demonstration of the particular capability of EDGR to evaluate one or more SNPs for effect size not only in one phenotype but across all phenotypes available in the EHR.

  56. 56

    Loscalzo, J., Kohane, I. & Barabasi, A. L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol. Syst. Biol. 3, 124 (2007).

  57. 57

    Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).

  58. 58

    Pulley, J., Clayton, E., Bernard, G., Roden, D. & Masys, D. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin. Transl. Sci. 3, 42–48 (2010).

  59. 59

    Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

  60. 60

    McGuire, A. L., Caulfield, T. & Cho, M. K. Research ethics and the challenge of whole-genome sequencing. Nature Rev. Genet. 9, 152–156 (2008).

  61. 61

    Denny, J. et al. Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation 122, 2016–2021 (2010).

  62. 62

    Hagen, S., Richmond, P., Vavrichek, B. & Baumgardner, J. Evidence on the costs and benefits of health information technology. Congressional Budget Office Washington DC [online], (2008). A sobering accounting of the costs of the implementation of EHR for clinical care.

  63. 63

    DiLaura, R. P. Clinical and translational science sustainability: overcoming integration issues between electronic health records (EHR) and clinical research data management systems “separate but equal”. Stud. Health Technol. Inform. 129, 137–141 (2007).

  64. 64

    Scheuner, M. et al. Are electronic health records ready for genomic medicine? Genet. Med. 11, 510–517 (2009).

  65. 65

    Sung, N. et al. Central challenges facing the national clinical research enterprise. JAMA 289, 1278–1287 (2003).

  66. 66

    Uzuner, O., Solti, I. & Cadag, E. Extracting medication information from clinical text. J. Am. Med. Inform. Assoc. 17, 514–518 (2010).

  67. 67

    Grannis, S. J., Overhage, J. M. & McDonald, C. J. Analysis of identifier performance using a deterministic linkage algorithm. Proc. AMIA Symp. 2002, 305–309 (2002).

  68. 68

    Finney, J. M., Walker, A. S., Peto, T. E. & Wyllie, D. H. An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med. Inform. Decis. Mak. 11, 7 (2011).

  69. 69

    Malin, B. A. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J. Am. Med. Inform. Assoc. 12, 28–34 (2005). A constructive approach to evaluating data privacy risks once genetic and EHR data become co-mingled.

  70. 70

    Barnes, D. Texas DNA Showdown. Mayborn, University of North Texas, Frank W. & Sue Mayborn School of Journalism [online], (2010).

  71. 71

    Taylor, P. Personal genomes: when consent gets in the way. Nature 456, 32–33 (2008).

  72. 72

    Kohane, I. S. et al. Medicine. Reestablishing the researcher-patient compact. Science 316, 836–837 (2007). A presentation of an alternative EDGR model, now in its pilot phase, in which patients are also subjects and can control if, when and with what information they are recontacted.

  73. 73

    Kohane, I. S. & Taylor, P. L. Multidimensional results reporting to participants in genomic studies: getting it right. Sci. Transl. Med. 2, 37cm19 (2010).

  74. 74

    van der Lei, J. et al. The introduction of computer-based patient records in The Netherlands. Ann. Intern. Med. 119, 1036–1041 (1993).

  75. 75

    Greenhalgh, T. et al. Adoption and non-adoption of a shared electronic summary record in England: a mixed-method case study. BMJ 340, c3111 (2010).

  76. 76

    Jha, A. K., Doolan, D., Grandt, D., Scott, T. & Bates, D. W. The use of health information technology in seven nations. Int. J. Med. Inform. 77, 848–854 (2008).

  77. 77

    de Lusignan, S., Metsemakers, J. F., Houwink, P., Gunnarsdottir, V. & van der Lei, J. Routinely collected general practice data: goldmines for research? A report of the European Federation for Medical Informatics Primary Care Informatics Working Group (EFMI PCIWG) from MIE2006, Maastricht, The Netherlands. Inform. Prim. Care 14, 203–209 (2006).

  78. 78

    O'Brien, S. Stewardship of human biospecimens, DNA, genotype, and clinical data in the GWAS era. Annu. Reb. Genomics Hum. Genet. 10, 193–209 (2009).

  79. 79

    Wolf, S. M. et al. Managing incidental findings in human subjects research: analysis and recommendations. J. Law Med. Ethics 36, 219–248 (2008).

  80. 80

    Kohane, I. S., Masys, D. R. & Altman, R. B. The incidentalome: a threat to genomic medicine. JAMA 296, 212–215 (2006).

  81. 81

    Thorisson, G. A., Muilu, J. & Brookes, A. J. Genotype–phenotype databases: challenges and solutions for the post-genomic era. Nature Rev. Genet. 10, 9–18 (2009).

  82. 82

    Weber, G. M. et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J. Am. Med. Inform. Assoc. 16, 624–630 (2009).

  83. 83

    Haspel, R. L. et al. A call to action: training pathology residents in genomics and personalized medicine. Am. J. Clin. Pathol. 133, 832–834 (2010).

  84. 84

    Freifeld, C. C. et al. Participatory epidemiology: use of mobile phones for community-based health reporting. PLoS Med. 7, e1000376 (2010). Going beyond EDGR, an exciting perspective of the use of non-institutional and informal sources of health-related data for population science.

  85. 85

    Patel, C., Bhattacharya, J., Butte, A. J. & Zhang, B. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS ONE 5, e10746 (2010).

  86. 86

    Pearson, J. F., Bachireddy, C., Shyamprasad, S., Goldfine, A. B. & Brownstein, J. S. Association between fine particulate matter and diabetes prevalence in the U. S. Diabetes Care 33, 2196–2201 (2010).

  87. 87

    Pulley, J. M., Brace, M. M., Bernard, G. R. & Masys, D. R. Attitudes and perceptions of patients towards methods of establishing a DNA biobank. Cell Tissue Bank 9, 55–65 (2008).

  88. 88

    Kohane, I. S. & Altman, R. B. Health-information altruists — a potentially critical resource. N. Engl. J. Med. 353, 2074–2077 (2005).

  89. 89

    Murphy, J. et al. Public expectations for return of results from large-cohort genetic research. Am. J. Bioeth. 8, 36–43 (2008).

  90. 90

    Kaufman, D., Murphy, J., Scott, J. & Hudson, K. Subjects matter: a survey of public opinions about a large genetic cohort study. Genet. Med. 10, 831–839 (2008).

  91. 91

    Taylor, P. L. Rules of engagement. Nature 450, 163–164 (2007).

  92. 92

    Taylor, P. L. Research sharing, ethics and public benefit. Nature Biotech. 25, 398–401 (2007).

  93. 93

    Fung, K. W., McDonald, C. & Bray, B. E. RxTerms — a drug interface terminology derived from RxNorm. AMIA Annu. Symp. Proc. 2008, 227–231 (2008).

  94. 94

    Harding, A. & Stuart-Buttle, C. The development and role of the Read Codes. J. AHIMA 69, 34–38 (1998).

  95. 95

    International statistical classification of diseases and related health problems: 10th revision. World Health Organization [online], (2007).

  96. 96

    McCray, A. T. The Unified Medical Language System: The UMLS Semantic Network. Proc. Annu. Symp. Comput. Appl. Med. Care. 1989, 503–507 (1989).

  97. 97

    Cote, R. A. & Robboy, S. Progress in medical information management: systematized nomenclature of medicine (SNOMED). JAMA 243, 756–762 (1980).

Download references


The following were kind enough to share insights into their respective EDGR-related efforts: S. Brunak, J. Kim, J. Terdiman, L. Walter, J. Starren, J. Vilo, D. Masys, D. Roden, N. Stimson, L. Bry and S. Churchill. Any errors in communicating these insights are the sole responsibility of the author. The author was supported in part by US National Institutes of Health funding for the US National Centers for Biomedical Computing, U54 LM008748.

Author information

Ethics declarations

Competing interests

The author declares no competing financial interests.

Related links

Related links


Isaac S. Kohane's homepage

Biobank Japan Project

Danish National Biobank

Database of Genotypes and Phenotypes (dbGaP)

eMERGE Network

Estonian Genome Center DNA sequencing costs


Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH)

Marshfield Clinic Personalized Medicine Research Project (PMRP)

UK Biobank

Vanderbilt BioVU



A biological materials repository that collects, processes, stores and distributes biospecimens to support future scientific investigation.

Natural language processing

(NLP). A field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. NLP techniques allow the text in electronic medical records to be transformed from a clinical narrative to a set of codified terms or tags that are more readily subject to computational and statistical analysis.


A cryogenic storage facility used to archive biological samples for use in research and experiments. Ranging in size from individual refrigerators to warehouses, biobanks are maintained by institutions such as hospitals, universities, non-profit organizations, pharmaceutical companies and national biorepositories. More recently, the term biobank has been used to signify a population cohort study with stored biological samples.

Population stratification

The presence of a systematic difference in allele frequencies between subpopulations from a larger population, possibly owing to different ancestry, especially in the context of association studies. (Population stratification is also referred to as population structure in this context.) If not properly accounted for in association studies, population stratification can lead to spurious associations.

Controlled vocabularies

A controlled vocabulary only includes terms that have been selected by the group that created the vocabulary. The goal of such a vocabulary is to standardize and simplify the organization of data and knowledge in a particular domain.


The set of all phenotypes expressed by a cell, tissue, organ, organism or species.


The entire stored data of an enterprise (for example, a health-care centre) is often termed the data warehouse. For a specified purpose (for example, a disease-specific study), a subset of the data warehouse, called the datamart, is extracted for a group of analysts.


The study of metagenomes, which consist of genetic material recovered directly from environmental samples. Increasingly, it is used to describe the shotgun sequencing and analysis of the microbial genomes found in the milieu of the human body and its waste products.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kohane, I. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12, 417–428 (2011).

Download citation

Further reading