High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Zhang, Yichi; Cai, Tianrun; Yu, Sheng; Cho, Kelly; Hong, Chuan; Sun, Jiehuan; Huang, Jie; Ho, Yuk-Lam; Ananthakrishnan, Ashwin N.; Xia, Zongqi; Shaw, Stanley Y.; Gainer, Vivian; Castro, Victor; Link, Nicholas; Honerlaw, Jacqueline; Huang, Sicong; Gagnon, David; Karlson, Elizabeth W.; Plenge, Robert M.; Szolovits, Peter; Savova, Guergana; Churchill, Susanne; O’Donnell, Christopher; Murphy, Shawn N.; Gaziano, J. Michael; Kohane, Isaac; Cai, Tianxi; Liao, Katherine P.

doi:10.1038/s41596-019-0227-6

Protocol
Published: 20 November 2019

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Yichi Zhang¹^na2,
Tianrun Cai²^na2,
Sheng Yu^3,4^na2,
Kelly Cho^5,6,
Chuan Hong¹,
Jiehuan Sun¹,
Jie Huang²,
Yuk-Lam Ho ORCID: orcid.org/0000-0003-3305-3830⁵,
Ashwin N. Ananthakrishnan⁷,
Zongqi Xia ORCID: orcid.org/0000-0003-1500-2589⁸,
Stanley Y. Shaw⁹,
Vivian Gainer¹⁰,
Victor Castro¹⁰,
Nicholas Link⁵,
Jacqueline Honerlaw⁵,
Sicong Huang²,
David Gagnon⁵^nAff16,
Elizabeth W. Karlson²,
Robert M. Plenge²,
Peter Szolovits¹¹,
Guergana Savova¹²,
Susanne Churchill¹³,
Christopher O’Donnell^5,14,
Shawn N. Murphy^10,13,15,
J. Michael Gaziano^5,6,
Isaac Kohane¹³,
Tianxi Cai^1,13^na1 &
…
Katherine P. Liao ORCID: orcid.org/0000-0002-4797-3200^2,5,13^na1

Nature Protocols volume 14, pages 3426–3444 (2019)Cite this article

4008 Accesses
73 Citations
18 Altmetric
Metrics details

Subjects

Abstract

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1–2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Fig. 2: Creating an NLP dictionary.

Fig. 3: Unsupervised feature learning.

**Fig. 4: Detailed flow of PheCAP protocol.**

**Fig. 7: Algorithm-training step output.**

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Stefan Feuerriegel, Dennis Frauen, … Mihaela van der Schaar

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

Generative models improve fairness of medical classifiers under distribution shifts

Article Open access 10 April 2024

Ira Ktena, Olivia Wiles, … Sven Gowal

Data availability

The datasets generated or analyzed in this protocol can be downloaded from https://celehs.github.io/PheCAP/.

Code availability

The R package and code referenced in this protocol can be downloaded from https://celehs.github.io/PheCAP/.

References

Brownstein, J. S. et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 33, 526–531 (2010).
PubMed Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
CAS PubMed PubMed Central Google Scholar
Kurreeman, F. et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am. J. Hum. Genet. 88, 57–69 (2011).
CAS PubMed PubMed Central Google Scholar
Liao, K. P. et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis Rheumatol. 65, 571–581 (2013).
CAS Google Scholar
Canela-Xandri, O. et al. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).
CAS PubMed PubMed Central Google Scholar
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
PubMed Google Scholar
Banda, J. M. et al. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Jt. Summit. Transl. Sci. Proc. 2017, (48–57 (2017).
Google Scholar
Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re71 (2011).
Google Scholar
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016).
PubMed PubMed Central Google Scholar
O’Malley, K. J. et al. Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40, 1620–1639 (2005).
PubMed PubMed Central Google Scholar
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care. Res. 62, 1120–1127 (2010).
Google Scholar
Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350, h1885 (2015).
PubMed PubMed Central Google Scholar
Yu, S. et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24, e143–e149 (2017).
PubMed Google Scholar
Yu, S. et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J. Am. Med. Inform. Assoc. 22, 993–1000 (2015).
PubMed PubMed Central Google Scholar
Castro, V. M. et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am. J. Psychiatry 172, 363–372 (2015).
PubMed Google Scholar
Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17, 124–130 (2010).
PubMed PubMed Central Google Scholar
Son, J. H. et al. Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes. Am. J. Hum. Genet. 103, 58–73 (2018).
CAS PubMed PubMed Central Google Scholar
Rasmussen, L. V. et al. Design patterns for the development of electronic health record-driven phenotype extraction algorithms. J. Biomed. Inform. 51, 280–286 (2014).
PubMed Google Scholar
Basile, A. O. et al. Informatics and machine learning to define the phenotype. Expert. Rev. Mol. Diagn. 18, 219–226 (2018).
CAS PubMed PubMed Central Google Scholar
Ananthakrishnan, A. N. et al. Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel. Dis. 19, 1411–1420 (2013).
PubMed PubMed Central Google Scholar
Carroll, R. J. et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J. Am. Med. Inform. Assoc. 19, e162–e169 (2012).
PubMed PubMed Central Google Scholar
Xia, Z. et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 8, e78927 (2013).
CAS PubMed PubMed Central Google Scholar
Ananthakrishnan, A. N. et al. Association between reduced plasma 25-hydroxy vitamin D and increased risk of cancer in patients with inflammatory bowel diseases. Clin. Gastroenterol. Hepatol. 12, 821–827 (2014).
CAS PubMed Google Scholar
Cai, T. et al. The association between arthralgia and vedolizumab using natural language processing. Inflamm. Bowel. Dis. 24, 2242–2246 (2018).
PubMed PubMed Central Google Scholar
Liao, K. P. et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Ann. Rheum. Dis. 73, 1170–1175 (2014).
CAS PubMed Google Scholar
Kurreeman, F. A. et al. Use of a multiethnic approach to identify rheumatoid- arthritis-susceptibility loci, 1p36 and 17q12. Am. J. Hum. Genet. 90, 524–532 (2012).
CAS PubMed PubMed Central Google Scholar
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
CAS PubMed Google Scholar
Ananthakrishnan, A. N. et al. Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases. Inflamm. Bowel. Dis. 21, 2507–2514 (2015).
PubMed PubMed Central Google Scholar
Sinnott, J. A. et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum. Genet. 133, 1369–1382 (2014).
PubMed PubMed Central Google Scholar
Halpern, Y. et al. Electronic medical record phenotyping using the anchor and learn framework. J. Am. Med. Inform. Assoc. 23, 731–740 (2016).
PubMed PubMed Central Google Scholar
Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).
PubMed PubMed Central Google Scholar
Yu, S. et al. Enabling phenotypic big data with PheNorm. J. Am. Med. Inform. Assoc. 25, 54–60 (2018).
PubMed Google Scholar
Lindberg, D. A. et al. The Unified Medical Language System. Methods Inf. Med. 32, 281–291 (1993).
CAS PubMed PubMed Central Google Scholar
Jupp, S., Burdett, T., Leroy, C. & Parkinson, H. A new ontology lookup service at EMBL-EBI. CEUR Workshop Proc. 1546, 118–119 (2015).
Google Scholar
Savova, G. K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17, 507–513 (2010).
PubMed PubMed Central Google Scholar
Goryachev, S. et al. A suite of natural language processing tools developed for the I2B2 project. AMIA Annu. Symp. Proc. 2006, 931 (2006).
PubMed Central Google Scholar
Liu, H. D., Wagholikar, K., Jonnalagadda, S. & Sohn, S. Integrated cTAKES for concept mention detection and normalization. In CEUR Workshop Proceedings, Vol. 1179 (CEUR-WS, 2013).
Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Symp. 17-21 (2001).
Yu, S. et al. NILE: fast natural language processing for electronic health records. Preprint at https://arxiv.org/abs/1311.6063 (2013).
Manning, C. et al. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60 (Association for Computational Linguistics, 2014).
Chapman, W. W. et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 301–310 (2001).
CAS PubMed Google Scholar
Castro, V. M. et al. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology 88, 164–168 (2017).
PubMed PubMed Central Google Scholar
Castro, V. M. et al. Identification of subjects with polycystic ovary syndrome using electronic health records. Reprod. Biol. Endocrinol. 13, 116 (2015).
PubMed PubMed Central Google Scholar
Jorge, A. et al. Identifying lupus patients in electronic health records: development and validation of machine learning algorithms and application of rule-based algorithms. Semin. Arthritis Rheum. 49, 84–90 (2019).
PubMed Google Scholar
Perlis, R. H. et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol. Med. 42, 41–50 (2012).
CAS PubMed Google Scholar
Doss, J., Mo, H., Carroll, R. J., Crofford, L. J. & Denny, J. C. Phenome-wide association study of rheumatoid arthritis subgroups identifies association between seronegative disease and fibromyalgia. Arthritis Rheumatol. 69, 291–300 (2017).
CAS PubMed PubMed Central Google Scholar
Geva, A. et al. A computable phenotype improves cohort ascertainment in a pediatric pulmonary hypertension registry. J. Pediatr. 188, 224–231 (2017).
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank R. Eastwood for her assistance with figure design and Z. He for establishing the PheCAP website. We gratefully acknowledge support for this project from the NIH (P30 AR 072577; Tianrun C., C.H., J.H., Tianxi C., and K.P.L.) and a VA Office of Research and Development VA Merit Award (I01-CX001025; K.C., Y.L.H., J.H., D.G., C.O., J.M.G.); past support for i2b2 from the NIH (U54 LM008748; A.N.A., Z.X., S.Y.S., V.G., V.C., E.W.K., R.M.P., P.S., G.S., S.C., S.N.M., I.K., Tianxi C., and K.P.L.) and support from grant R01 HG009174 (V.G., V.C., and S.N.M.). A.N.A. received support from the Crohn’s and Colitis Foundation, the NIH, and Pfizer. Z.X. received support from the NIH (NINDS098023). S.H. received support from grant T32 AR 007530. K.P.L. received support from the Harold and DuVal Bowen Fund.

Author information

David Gagnon
Present address: Department of Biostatistics, Boston University, Boston, MA, USA
These authors jointly supervised the work: Tianxi Cai, Katherine P. Liao.
These authors contributed equally: Yichi Zhang, Tianrun Cai, Sheng Yu.

Authors and Affiliations

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Yichi Zhang, Chuan Hong, Jiehuan Sun & Tianxi Cai
Division of Rheumatology, Immunology, and Allergy, Brigham and Women’s Hospital, Boston, MA, USA
Tianrun Cai, Jie Huang, Sicong Huang, Elizabeth W. Karlson, Robert M. Plenge & Katherine P. Liao
Center for Statistical Science, Tsinghua University, Beijing, China
Sheng Yu
Department of Industrial Engineering, Tsinghua University, Beijing, China
Sheng Yu
Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
Kelly Cho, Yuk-Lam Ho, Nicholas Link, Jacqueline Honerlaw, David Gagnon, Christopher O’Donnell, J. Michael Gaziano & Katherine P. Liao
Division of Aging, Brigham and Women’s Hospital, Boston, MA, USA
Kelly Cho & J. Michael Gaziano
Department of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA
Ashwin N. Ananthakrishnan
Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
Zongqi Xia
Division of Cardiovascular Medicine, Brigham and Women’s Hospital, Boston, MA, USA
Stanley Y. Shaw
Research Information Science and Computing, Partners Healthcare, Boston, MA, USA
Vivian Gainer, Victor Castro & Shawn N. Murphy
Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
Peter Szolovits
Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, USA
Guergana Savova
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Susanne Churchill, Shawn N. Murphy, Isaac Kohane, Tianxi Cai & Katherine P. Liao
Division of Cardiology, VA Boston Healthcare System, Boston, MA, USA
Christopher O’Donnell
Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
Shawn N. Murphy

Authors

Yichi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tianrun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Cho
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Hong
View author publications
You can also search for this author in PubMed Google Scholar
Jiehuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jie Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuk-Lam Ho
View author publications
You can also search for this author in PubMed Google Scholar
Ashwin N. Ananthakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Zongqi Xia
View author publications
You can also search for this author in PubMed Google Scholar
Stanley Y. Shaw
View author publications
You can also search for this author in PubMed Google Scholar
Vivian Gainer
View author publications
You can also search for this author in PubMed Google Scholar
Victor Castro
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas Link
View author publications
You can also search for this author in PubMed Google Scholar
Jacqueline Honerlaw
View author publications
You can also search for this author in PubMed Google Scholar
Sicong Huang
View author publications
You can also search for this author in PubMed Google Scholar
David Gagnon
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth W. Karlson
View author publications
You can also search for this author in PubMed Google Scholar
Robert M. Plenge
View author publications
You can also search for this author in PubMed Google Scholar
Peter Szolovits
View author publications
You can also search for this author in PubMed Google Scholar
Guergana Savova
View author publications
You can also search for this author in PubMed Google Scholar
Susanne Churchill
View author publications
You can also search for this author in PubMed Google Scholar
Christopher O’Donnell
View author publications
You can also search for this author in PubMed Google Scholar
Shawn N. Murphy
View author publications
You can also search for this author in PubMed Google Scholar
J. Michael Gaziano
View author publications
You can also search for this author in PubMed Google Scholar
Isaac Kohane
View author publications
You can also search for this author in PubMed Google Scholar
Tianxi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Katherine P. Liao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Z., Tianrun C., S.Y., C.H., J.S., A.N.A., Z.X., S.Y.S., V.G., V.C., N.L., E.W.K., R.M.P., P.S., G.S., S.C., S.N.M., I.K., Tianxi C., and K.P.L. contributed to the development of pipeline; Y.Z., Tianrun C., S.Y., C.H., J.S., N.L., and Tianxi C. contributed to the development of the R package and software development used in this protocol; Y.Z., Tianrun C., K.C., C.H., J.S., J. Huang, Y.-L.H., A.N.A., Z.X., S.Y.S., V.G., V.C., N.L., J. Honerlaw, S.H., D.G., P.S., G.S,. S.C., C.O., S.N.M., J.M.G., I.K., Tianxi C., and K.P.L. contributed to the validation of and enhancements to the pipeline; Y.Z., Tianrun C., S.Y., C.H., J.S., V.G., V.C., G.S., Tianxi C., and K.P.L. drafted the manuscript; all authors contributed to revisions and proofreading of the manuscript.

Corresponding author

Correspondence to Katherine P. Liao.

Ethics declarations

Competing interests

R.M.P. is employed at Celgene; however, his contributions to the protocol were performed while at Brigham and Women’s Hospital. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks Juan Banda and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Cai, T., Yu, S. et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP). Nat Protoc 14, 3426–3444 (2019). https://doi.org/10.1038/s41596-019-0227-6

Download citation

Received: 19 October 2018
Accepted: 22 July 2019
Published: 20 November 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s41596-019-0227-6

This article is cited by

Mitigating Bias in Clinical Machine Learning Models
- Julio C. Perez-Downes
- Andrew S. Tseng
- Demilade Adedinsewo
Current Treatment Options in Cardiovascular Medicine (2024)
Potential pitfalls in the use of real-world data for studying long COVID
- Harrison G. Zhang
- Jacqueline P. Honerlaw
- Gabriel A. Brat
Nature Medicine (2023)
Introducing AI to the molecular tumor board: one direction toward the establishment of precision medicine using large-scale cancer clinical and biological information
- Ryuji Hamamoto
- Takafumi Koyama
- Noboru Yamamoto
Experimental Hematology & Oncology (2022)
Visualizing novel connections and genetic similarities across diseases using a network-medicine based approach
- Brian Ferolito
- Italo Faria do Valle
- Kelly Cho
Scientific Reports (2022)
Semi-supervised approach to event time annotation using longitudinal electronic health records
- Liang Liang
- Jue Hou
- Tianxi Cai
Lifetime Data Analysis (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.