Diagnosing monogenic diseases facilitates optimal care, but can involve the manual evaluation of hundreds of genetic variants per case. Computational tools like Phrank expedite this process by ranking all candidate genes by their ability to explain the patient’s phenotypes. To use these tools, busy clinicians must manually encode patient phenotypes from lengthy clinical notes. With 100 million human genomes estimated to be sequenced by 2025, a fast alternative to manual phenotype extraction from clinical notes will become necessary.


We introduce ClinPhen, a fast, high-accuracy tool that automatically converts clinical notes into a prioritized list of patient phenotypes using Human Phenotype Ontology (HPO) terms.


ClinPhen shows superior accuracy and 20× speedup over existing phenotype extractors, and its novel phenotype prioritization scheme improves the performance of gene-ranking tools.


While a dedicated clinician can process 200 patient records in a 40-hour workweek, ClinPhen does the same in 10 minutes. Compared with manual phenotype extraction, ClinPhen saves an additional 3–5 hours per Mendelian disease diagnosis. Providers can now add ClinPhen’s output to each summary note attached to a filled testing laboratory request form. ClinPhen makes a substantial contribution to improvements in efficiency critically needed to meet the surging demand for clinical diagnostic sequencing.

We thank Julia Buckingham and Morgan Danowski for assistance with obtaining patient data; Paul McDonagh and Margaret Bray for introductions and data sharing; Charlie Curnin, Marta Maria Majcherska, and Colleen McCormack for facilitating access to patient data; and Bejerano Lab members and Elijah Kravets for project feedback. Clinicians’ research was supported by the National Insitutes of Health (NIH) Common Fund, Office of Strategic Coordination/Office of the NIH Director Awards U01HG007690, U01HG007708, U01HG007530, U01HG007942. Manton Center sequence analysis and diagnosis was supported by NIH 1U54HD090255 IDDRC Molecular Genetics Core grant. The Duke UDN site is funded by NIH grant U01HG007672 (principal investigators: V. Shashi and D.B. Goldstein). UCLA’s J.A.M.-A. and R.S. were supported by UDN grant HG007703-05. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. All computational tool building was supported by a Stanford Bio-X Undergraduate Summer Research Program (C.A.D.), a Bio-X Stanford Interdisciplinary Graduate Fellowship (J.B.) and by the Defense Advanced Research Projects Agency (DARPA) and the Stanford Pediatrics Department (G.B.).

  1. Undiagnosed Diseases Network


    The authors declare no conflicts of interest.

