Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Using electronic health records to drive discovery in disease genomics

Key Points

  • The recently projected sample-size requirements for genomic studies for common variants of modest effect size, and rare variants of larger effect size, are rapidly outpacing the capabilities and budgets of most investigators and organizations.

  • The ongoing substantial investment in electronic health records (EHRs) for clinical care can be leveraged to cost-effectively accelerate population-scale genomic research by at least an order of magnitude and to reduce costs by at least an order of magnitude.

  • EHR-driven genomic research (EDGR) uses both the codified data available in the EHR and the phenotypic characterizations buried in the narrative text of the record by means of natural language processing (NLP).

  • Among the advantages of EDGR versus conventional cohort studies are timely clinical relevance and cost-effective scalability.

  • Biobanking and existing cohort studies will increasingly use EHR-derived data to augment the phenotypic characterizations obtained.

  • EDGR studies have already shown they can reproduce conventionally run genome-wide association (GWA) studies and extend the findings of those prior GWA studies to additional, often underrepresented populations.

  • EDGR can also enable studies that would be difficult to conduct otherwise, such as calculating the effect size of a genetic variant not for one disease or trait but for all diseases and traits captured in the EHR (a so-called phenome-wide association study).

  • A thorny challenge is the broad and international adoption of a standardized consent model or regulatory framework. If unaddressed, this challenge may impede further rapid adoption of EDGR. The patchy implementation of EHRs and their very large costs also will slow adoption of EDGR techniques.

Abstract

If genomic studies are to be a clinically relevant and timely reflection of the relationship between genetics and health status — whether for common or rare variants — cost-effective ways must be found to measure both the genetic variation and the phenotypic characteristics of large populations, including the comprehensive and up-to-date record of their medical treatment. The adoption of electronic health records, used by clinicians to document clinical care, is becoming widespread and recent studies demonstrate that they can be effectively employed for genetic studies using the informational and biological 'by-products' of health-care delivery while maintaining patient privacy.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: From clinical notes to structured phenotypes.
Figure 2: Two archetypal workflows in electronic health record-driven genomic research.

Similar content being viewed by others

References

  1. Green, E. D., Guyer, M. S. & National Human Genome Research Institute. Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204–213 (2011).

    Article  CAS  PubMed  Google Scholar 

  2. Ioannidis, J. P., Trikalinos, T. A. & Khoury, M. J. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614 (2006).

    PubMed  Google Scholar 

  3. Dina, C. New insights into the genetics of body weight. Curr. Opin. Clin. Nutr. Metab. Care 11, 378–384 (2008).

    CAS  PubMed  Google Scholar 

  4. Gauderman, W. J. Sample size requirements for association studies of gene–gene interaction. Am. J. Epidemiol. 155, 478–484 (2002).

    PubMed  Google Scholar 

  5. Hein, R., Beckmann, L. & Chang-Claude, J. Sample size requirements for indirect association studies of gene–environment interactions (G x E). Genet. Epidemiol. 32, 235–245 (2008).

    PubMed  Google Scholar 

  6. Manolio, T. A., Bailey-Wilson, J. E. & Collins, F. S. Genes, environment and the value of prospective cohort studies. Nature Rev. Genet. 7, 812–820 (2006).

    CAS  PubMed  Google Scholar 

  7. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

    CAS  PubMed  Google Scholar 

  8. Gismondi, P. M. et al. Strategies, time, and costs associated with the recruitment and enrollment of nursing home residents for a micronutrient supplementation clinical trial. J. Gerontol. A Biol. Sci. Med. Sci. 60, 1469–1474 (2005).

    PubMed  Google Scholar 

  9. Noble, S. et al. Feasibility and cost of obtaining informed consent for essential review of medical records in large-scale health services research. J. Health Serv. Res. Policy 14, 77 (2009).

    PubMed  Google Scholar 

  10. Schroy, P. C. et al. A cost-effectiveness analysis of subject recruitment strategies in the HIPAA era: results from a colorectal cancer screening adherence trial. Clin. Trials 6, 597–609 (2009).

    PubMed  Google Scholar 

  11. Zika, E. et al. A European survey on biobanks: trends and issues. Public Health Genomics 14, 96–103 (2010).

    PubMed  Google Scholar 

  12. Tutton, R., Kaye, J. & Hoeyer, K. Governing UK Biobank: the importance of ensuring public trust. Trends Biotechnol. 22, 284–285 (2004).

    CAS  PubMed  Google Scholar 

  13. Nakamura, Y. The BioBank Japan Project. Clin. Adv. Hematol. Oncol. 5, 696–697 (2007).

    PubMed  Google Scholar 

  14. Hawkins, A. K. Biobanks: importance, implications and opportunities for genetic counselors. J. Genet. Couns. 19, 423–429 (2010).

    PubMed  Google Scholar 

  15. Hewitt, R. E. Biobanking: the foundation of personalized medicine. Curr. Opin. Oncol. 23, 112–119 (2011).

    PubMed  Google Scholar 

  16. Ballantyne, C. Report urges Europe to combine wealth of biobank data. Nature Med. 14, 701 (2008).

    PubMed  Google Scholar 

  17. Founti, P. et al. Biobanks and the importance of detailed phenotyping: a case study — the European Glaucoma Society GlaucoGENE project. Br. J. Ophthalmol. 93, 577–581 (2009).

    CAS  PubMed  Google Scholar 

  18. Tunis, S. R., Stryer, D. B. & Clancy, C. M. Practical clinical trials: increasing the value of clinical research for decision making in clinical and health policy. JAMA 290, 1624–1632 (2003).

    CAS  PubMed  Google Scholar 

  19. Charlson, M. E. & Horwitz, R. I. Applying results of randomised trials to clinical practice: impact of losses before randomisation. BMJ (Clin. Res. Ed.) 289, 1281–1284 (1984).

    CAS  Google Scholar 

  20. Pablos-Méndez, A., Barr, R. G. & Shea, S. Run-in periods in randomized trials: implications for the application of results in clinical practice. JAMA 279, 222–225 (1998).

    PubMed  Google Scholar 

  21. August, J. Market watch: emerging companion diagnostics for cancer drugs. Nature Rev. Drug Discov. 9, 351 (2010).

    CAS  Google Scholar 

  22. Brownstein, J. S., Freifeld, C. C., Reis, B. Y. & Mandl, K. D. Surveillance Sans Frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med. 5, e151 (2008).

    PubMed  PubMed Central  Google Scholar 

  23. Kielbasa, A. M., Pomerantz, A. M., Krohn, E. J. & Sullivan, B. F. How does clients' method of payment influence psychologists' diagnostic decisions? Ethics Behav. 14, 187–195 (2004).

    PubMed  Google Scholar 

  24. Tuckson, R. V. et al. Policy issues associated with undertaking a new large, U. S. population cohort study of genes, environment, and disease. Department of Health and Human Services, Washington DC [online], (2007). A Landmark report by the US Department of Health and Human Services on the value of large cohort genetic studies of one million or more subjects and the attendant costs and challenges.

    Google Scholar 

  25. Murphy, S. et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res. 19, 1675–1681 (2009). A summary of the i2b2 approach to EDGR along with detailed estimates of the financial costs of conducting EDGR.

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Beasley, D. Remembering recruitment: the impact of proactive subject recruitment planning. Applied Clinical Trials Online [online], (2008).

    Google Scholar 

  27. Jha, A. K. et al. Use of electronic health records in U. S. hospitals. N. Engl. J. Med. 360, 1628–1638 (2009). A cautionary survey of the lack of implementation of comprehensive EHRs in the United States.

    CAS  PubMed  Google Scholar 

  28. Collins, F. S., Green, E. D., Guttmacher, A. E., Guyer, M. S. & US National Human Genome Research Institute. A vision for the future of genomics research. Nature 422, 835–847 (2003).

    CAS  PubMed  Google Scholar 

  29. Ranganathan, M. & Bhopal, R. Exclusion and inclusion of nonwhite ethnic minority groups in 72 North American and European cardiovascular cohort studies. PLoS Med. 3, e44 (2006).

    PubMed  PubMed Central  Google Scholar 

  30. Stone, V. E., Mauch, M. Y., Steger, K., Janas, S. F. & Craven, D. E. Race, gender, drug use, and participation in AIDS clinical trials. Lessons from a municipal hospital cohort. J. Gen. Intern. Med. 12, 150–157 (1997).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Larson, E. Exclusion of certain groups from clinical research. Image J. Nurs. Sch. 26, 185–190 (1994).

    CAS  PubMed  Google Scholar 

  32. Michelen, W., Martinez, J., Lee, A. & Wheeler, D. P. Reducing frequent flyer emergency department visits. J. Health Care Poor Underserved 17, 59–69 (2006).

    PubMed  Google Scholar 

  33. Roby, D. H., Nicholson, G. L. & Kominski, G. F. African Americans in commercial HMOs more likely to delay prescription drugs and use the emergency room. UCLA Center for Health and Policy Research [online], (2009).

    Google Scholar 

  34. Jones, R., Lin, S., Munsie, J. P., Radigan, M. & Hwang, S. A. Racial/ethnic differences in asthma-related emergency department visits and hospitalizations among children with wheeze in Buffalo, New York. J. Asthma 45, 916–922 (2008).

    PubMed  Google Scholar 

  35. Wolff, J. L., Starfield, B. & Anderson, G. Prevalence, expenditures, and complications of multiple chronic conditions in the elderly. Arch. Intern. Med. 162, 2269–2276 (2002).

    PubMed  Google Scholar 

  36. Brownstein, J. S. et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 33, 526–531 (2010). A demonstration of the use of EHR data for timely identification of medically relevant trends; in this case the increased cardiovascular-related mortality associated with a specific oral hypoglycaemic agent.

    PubMed  Google Scholar 

  37. Brownstein, J. S., Sordo, M., Kohane, I. S. & Mandl, K. D. The tell-tale heart: population-based surveillance reveals an association of rofecoxib and celecoxib with myocardial infarction. PLoS ONE 2, e840 (2007).

    PubMed  PubMed Central  Google Scholar 

  38. McCarty, C. A. & Wilke, R. A. Biobanking and pharmacogenomics. Pharmacogenomics 11, 637–641 (2010).

    CAS  PubMed  Google Scholar 

  39. Kosoy, R. et al. Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum. Mutat. 30, 69–78 (2009).

    PubMed  PubMed Central  Google Scholar 

  40. Dumitrescu, L. et al. Assessing the accuracy of observer-reported ancestry in a biorepository linked to electronic medical records. Genet. Med. 12, 648–650 (2010).

    PubMed  PubMed Central  Google Scholar 

  41. Ioannidis, J. P. A. Non-replication and inconsistency in the genome-wide association setting. Hum. Hered. 64, 203–213 (2007).

    CAS  PubMed  Google Scholar 

  42. Gulcher, J. & Stefansson, K. deCODE: A genealogical approach to human genetics in Iceland. Wiley Online Library [online], (2006).

    Google Scholar 

  43. Murphy, S. N., Mendis, M. E., Berkowicz, D. A., Kohane, I. S. & Chueh, H. C. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu. Symp. Proc., 1040 (2006).

  44. Roden, D. M. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther. 84, 362–369 (2008). A detailed description of the implementation of EDGR in an institution that is one of the leaders in this domain.

    CAS  PubMed  Google Scholar 

  45. Clayton, E. et al. Confronting real time ethical, legal, and social issues in the Electronic Medical Records and Genomics (eMERGE) Consortium. Genet. Med. 12, 616–620 (2010). A useful summary of the various ethical and legal controversies that are entailed by EDGR.

    PubMed  PubMed Central  Google Scholar 

  46. Kullo, I. J., Ding, K., Jouni, H., Smith, C. Y. & Chute, C. G. A genome-wide association study of red blood cell traits using the electronic medical record. PLoS ONE 5, e13011 (2010).

    PubMed  PubMed Central  Google Scholar 

  47. Ritchie, M. et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am. J. Hum. Genet. 86, 560–572 (2010). One of the earliest examples of conventional cohort GWA study results being reproduced using EDGR.

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011). Early example of extending a GWA study result to other populations using EDGR.

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Melton, G. B. et al. Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report. J. Am. Med. Inform. Assoc. 17, 337–340 (2010).

    PubMed  PubMed Central  Google Scholar 

  50. Sager, N., Lyman, M., Bucknall, C., Nhan, N. & Tick, L. J. Natural language processing and the representation of clinical data. J. Am. Med. Inform. Assoc. 1, 142–160 (1994).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Lindberg, D. A., Humphreys, B. L. & McCray, A. T. The unified medical language system. Methods Inf. Med. 32, 281–291 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. (Hoboken) 62, 1120–1127 (2010). A detailed description of the application of NLP in EDGR and estimates of its accuracy.

    Google Scholar 

  53. Uzuner, O., Goldstein, I., Luo, Y. & Kohane, I. Identifying patient smoking status from medical discharge records. J. Am. Med. Inform. Assoc. 15, 14–24 (2008).

    PubMed  PubMed Central  Google Scholar 

  54. Jones, R., Pembrey, M., Golding, J. & Herrick, D. The search for genenotype/phenotype associations and the phenome scan. Paediatr. Perinat. Epidemiol. 19, 264–275 (2005).

    PubMed  Google Scholar 

  55. Denny, J. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210 (2010). An impressive demonstration of the particular capability of EDGR to evaluate one or more SNPs for effect size not only in one phenotype but across all phenotypes available in the EHR.

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Loscalzo, J., Kohane, I. & Barabasi, A. L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol. Syst. Biol. 3, 124 (2007).

    PubMed  PubMed Central  Google Scholar 

  57. Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).

    PubMed  PubMed Central  Google Scholar 

  58. Pulley, J., Clayton, E., Bernard, G., Roden, D. & Masys, D. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin. Transl. Sci. 3, 42–48 (2010).

    PubMed  PubMed Central  Google Scholar 

  59. Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

    CAS  PubMed  Google Scholar 

  60. McGuire, A. L., Caulfield, T. & Cho, M. K. Research ethics and the challenge of whole-genome sequencing. Nature Rev. Genet. 9, 152–156 (2008).

    CAS  PubMed  Google Scholar 

  61. Denny, J. et al. Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation 122, 2016–2021 (2010).

    PubMed  PubMed Central  Google Scholar 

  62. Hagen, S., Richmond, P., Vavrichek, B. & Baumgardner, J. Evidence on the costs and benefits of health information technology. Congressional Budget Office Washington DC [online], (2008). A sobering accounting of the costs of the implementation of EHR for clinical care.

    Google Scholar 

  63. DiLaura, R. P. Clinical and translational science sustainability: overcoming integration issues between electronic health records (EHR) and clinical research data management systems “separate but equal”. Stud. Health Technol. Inform. 129, 137–141 (2007).

    PubMed  Google Scholar 

  64. Scheuner, M. et al. Are electronic health records ready for genomic medicine? Genet. Med. 11, 510–517 (2009).

    PubMed  Google Scholar 

  65. Sung, N. et al. Central challenges facing the national clinical research enterprise. JAMA 289, 1278–1287 (2003).

    PubMed  Google Scholar 

  66. Uzuner, O., Solti, I. & Cadag, E. Extracting medication information from clinical text. J. Am. Med. Inform. Assoc. 17, 514–518 (2010).

    PubMed  PubMed Central  Google Scholar 

  67. Grannis, S. J., Overhage, J. M. & McDonald, C. J. Analysis of identifier performance using a deterministic linkage algorithm. Proc. AMIA Symp. 2002, 305–309 (2002).

    Google Scholar 

  68. Finney, J. M., Walker, A. S., Peto, T. E. & Wyllie, D. H. An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med. Inform. Decis. Mak. 11, 7 (2011).

    PubMed  PubMed Central  Google Scholar 

  69. Malin, B. A. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. J. Am. Med. Inform. Assoc. 12, 28–34 (2005). A constructive approach to evaluating data privacy risks once genetic and EHR data become co-mingled.

    PubMed  PubMed Central  Google Scholar 

  70. Barnes, D. Texas DNA Showdown. Mayborn, University of North Texas, Frank W. & Sue Mayborn School of Journalism [online], (2010).

    Google Scholar 

  71. Taylor, P. Personal genomes: when consent gets in the way. Nature 456, 32–33 (2008).

    CAS  PubMed  Google Scholar 

  72. Kohane, I. S. et al. Medicine. Reestablishing the researcher-patient compact. Science 316, 836–837 (2007). A presentation of an alternative EDGR model, now in its pilot phase, in which patients are also subjects and can control if, when and with what information they are recontacted.

    CAS  PubMed  Google Scholar 

  73. Kohane, I. S. & Taylor, P. L. Multidimensional results reporting to participants in genomic studies: getting it right. Sci. Transl. Med. 2, 37cm19 (2010).

    PubMed  Google Scholar 

  74. van der Lei, J. et al. The introduction of computer-based patient records in The Netherlands. Ann. Intern. Med. 119, 1036–1041 (1993).

    CAS  PubMed  Google Scholar 

  75. Greenhalgh, T. et al. Adoption and non-adoption of a shared electronic summary record in England: a mixed-method case study. BMJ 340, c3111 (2010).

    PubMed  Google Scholar 

  76. Jha, A. K., Doolan, D., Grandt, D., Scott, T. & Bates, D. W. The use of health information technology in seven nations. Int. J. Med. Inform. 77, 848–854 (2008).

    PubMed  Google Scholar 

  77. de Lusignan, S., Metsemakers, J. F., Houwink, P., Gunnarsdottir, V. & van der Lei, J. Routinely collected general practice data: goldmines for research? A report of the European Federation for Medical Informatics Primary Care Informatics Working Group (EFMI PCIWG) from MIE2006, Maastricht, The Netherlands. Inform. Prim. Care 14, 203–209 (2006).

    PubMed  Google Scholar 

  78. O'Brien, S. Stewardship of human biospecimens, DNA, genotype, and clinical data in the GWAS era. Annu. Reb. Genomics Hum. Genet. 10, 193–209 (2009).

    CAS  Google Scholar 

  79. Wolf, S. M. et al. Managing incidental findings in human subjects research: analysis and recommendations. J. Law Med. Ethics 36, 219–248 (2008).

    PubMed  PubMed Central  Google Scholar 

  80. Kohane, I. S., Masys, D. R. & Altman, R. B. The incidentalome: a threat to genomic medicine. JAMA 296, 212–215 (2006).

    CAS  PubMed  Google Scholar 

  81. Thorisson, G. A., Muilu, J. & Brookes, A. J. Genotype–phenotype databases: challenges and solutions for the post-genomic era. Nature Rev. Genet. 10, 9–18 (2009).

    CAS  PubMed  Google Scholar 

  82. Weber, G. M. et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J. Am. Med. Inform. Assoc. 16, 624–630 (2009).

    PubMed  PubMed Central  Google Scholar 

  83. Haspel, R. L. et al. A call to action: training pathology residents in genomics and personalized medicine. Am. J. Clin. Pathol. 133, 832–834 (2010).

    PubMed  Google Scholar 

  84. Freifeld, C. C. et al. Participatory epidemiology: use of mobile phones for community-based health reporting. PLoS Med. 7, e1000376 (2010). Going beyond EDGR, an exciting perspective of the use of non-institutional and informal sources of health-related data for population science.

    PubMed  PubMed Central  Google Scholar 

  85. Patel, C., Bhattacharya, J., Butte, A. J. & Zhang, B. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS ONE 5, e10746 (2010).

    PubMed  PubMed Central  Google Scholar 

  86. Pearson, J. F., Bachireddy, C., Shyamprasad, S., Goldfine, A. B. & Brownstein, J. S. Association between fine particulate matter and diabetes prevalence in the U. S. Diabetes Care 33, 2196–2201 (2010).

    PubMed  PubMed Central  Google Scholar 

  87. Pulley, J. M., Brace, M. M., Bernard, G. R. & Masys, D. R. Attitudes and perceptions of patients towards methods of establishing a DNA biobank. Cell Tissue Bank 9, 55–65 (2008).

    PubMed  Google Scholar 

  88. Kohane, I. S. & Altman, R. B. Health-information altruists — a potentially critical resource. N. Engl. J. Med. 353, 2074–2077 (2005).

    CAS  PubMed  Google Scholar 

  89. Murphy, J. et al. Public expectations for return of results from large-cohort genetic research. Am. J. Bioeth. 8, 36–43 (2008).

    PubMed  PubMed Central  Google Scholar 

  90. Kaufman, D., Murphy, J., Scott, J. & Hudson, K. Subjects matter: a survey of public opinions about a large genetic cohort study. Genet. Med. 10, 831–839 (2008).

    PubMed  Google Scholar 

  91. Taylor, P. L. Rules of engagement. Nature 450, 163–164 (2007).

    CAS  PubMed  Google Scholar 

  92. Taylor, P. L. Research sharing, ethics and public benefit. Nature Biotech. 25, 398–401 (2007).

    CAS  Google Scholar 

  93. Fung, K. W., McDonald, C. & Bray, B. E. RxTerms — a drug interface terminology derived from RxNorm. AMIA Annu. Symp. Proc. 2008, 227–231 (2008).

    PubMed Central  Google Scholar 

  94. Harding, A. & Stuart-Buttle, C. The development and role of the Read Codes. J. AHIMA 69, 34–38 (1998).

    CAS  PubMed  Google Scholar 

  95. International statistical classification of diseases and related health problems: 10th revision. World Health Organization [online], (2007).

  96. McCray, A. T. The Unified Medical Language System: The UMLS Semantic Network. Proc. Annu. Symp. Comput. Appl. Med. Care. 1989, 503–507 (1989).

    Google Scholar 

  97. Cote, R. A. & Robboy, S. Progress in medical information management: systematized nomenclature of medicine (SNOMED). JAMA 243, 756–762 (1980).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The following were kind enough to share insights into their respective EDGR-related efforts: S. Brunak, J. Kim, J. Terdiman, L. Walter, J. Starren, J. Vilo, D. Masys, D. Roden, N. Stimson, L. Bry and S. Churchill. Any errors in communicating these insights are the sole responsibility of the author. The author was supported in part by US National Institutes of Health funding for the US National Centers for Biomedical Computing, U54 LM008748.

Author information

Authors and Affiliations

Authors

Ethics declarations

Competing interests

The author declares no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Isaac S. Kohane's homepage

Biobank Japan Project

Danish National Biobank

Database of Genotypes and Phenotypes (dbGaP)

eMERGE Network

Estonian Genome Center

Genome.gov DNA sequencing costs

i2b2

Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH)

Marshfield Clinic Personalized Medicine Research Project (PMRP)

UK Biobank

Vanderbilt BioVU

Glossary

Biorepository

A biological materials repository that collects, processes, stores and distributes biospecimens to support future scientific investigation.

Natural language processing

(NLP). A field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. NLP techniques allow the text in electronic medical records to be transformed from a clinical narrative to a set of codified terms or tags that are more readily subject to computational and statistical analysis.

Biobank

A cryogenic storage facility used to archive biological samples for use in research and experiments. Ranging in size from individual refrigerators to warehouses, biobanks are maintained by institutions such as hospitals, universities, non-profit organizations, pharmaceutical companies and national biorepositories. More recently, the term biobank has been used to signify a population cohort study with stored biological samples.

Population stratification

The presence of a systematic difference in allele frequencies between subpopulations from a larger population, possibly owing to different ancestry, especially in the context of association studies. (Population stratification is also referred to as population structure in this context.) If not properly accounted for in association studies, population stratification can lead to spurious associations.

Controlled vocabularies

A controlled vocabulary only includes terms that have been selected by the group that created the vocabulary. The goal of such a vocabulary is to standardize and simplify the organization of data and knowledge in a particular domain.

Phenome

The set of all phenotypes expressed by a cell, tissue, organ, organism or species.

Datamart

The entire stored data of an enterprise (for example, a health-care centre) is often termed the data warehouse. For a specified purpose (for example, a disease-specific study), a subset of the data warehouse, called the datamart, is extracted for a group of analysts.

Metagenomics

The study of metagenomes, which consist of genetic material recovered directly from environmental samples. Increasingly, it is used to describe the shotgun sequencing and analysis of the microbial genomes found in the milieu of the human body and its waste products.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kohane, I. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12, 417–428 (2011). https://doi.org/10.1038/nrg2999

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2999

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research