Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

We have made available the Jupyter notebook that we used in constructing and validating the hierarchical logistic regression models: https://s3.cn-north-1.amazonaws.com.cn/ped.emr/Data/hierachical_logistic_regression.ipynb. To protect patient confidentiality, we have deposited de-identified aggregated patient data in a secured and patient confidentiality compliant cloud in China in concordance with data security regulations. Data access can be requested by writing to the corresponding authors. All data access requests will be reviewed and (if successful) granted by the Data Access Committee.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Hu, J., Perer, A. & Wang, F. Data Driven Analytics for Personalized Healthcare. (Springer Internatonal Publishing, Switzerland, Healthcare Information Management Systems: Cases, Strategies, and Solutions, 2016).

  2. 2.

    Nezhad, M.Z., Zhu, D.X., Sadati, N., Yang, K. & Levy, P. SUBIC: A supervised bi-clustering approach for precision medicine. 2017 16th Ieee International Conference on Machine Learning and Applications (Icmla). Preprint at https://arxiv.org/pdf/1709.09929.pdf (2017).

  3. 3.

    Hornberger, J. Electronic health records: a guide for clinicians and administrators. JAMA 301, 110–110 (2009).

  4. 4.

    Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).

  5. 5.

    Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131 (2018).

  6. 6.

    Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410 (2016).

  7. 7.

    Erickson, B. J., Korfiatis, P., Akkus, Z. & Kline, T. L. Machine learning for medical imaging. Radiographics 37, 505–515 (2017).

  8. 8.

    Wang, F., Zhang, P., Qian, B., Wang, X. & Davidson, I. Clinical risk prediction with multilinear sparse logistic regression. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.145–154 (2014).

  9. 9.

    Turchin, A. et al. Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes. J. Am. Med. Inform. Assoc. 13, 691–695 (2006).

  10. 10.

    Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intelligent Systems 24, 8–12 (2009).

  11. 11.

    Banko, M. & Brill, E. Scaling to very very large corpora for natural language disambiguation. In Proc. 39th Annual Meeting Association for Computational Linguistics. 26–33 (Association for Computational Linguistics, Stroudsburg, 2001).

  12. 12.

    Tsui, B. Y., et al. Creating a scalable deep learning based named entity recognition model for biomedical textual data by repurposing biosample free-text annotations. Preprint at https://www.biorxiv.org/content/biorxiv/early/2018/09/12/414136.full.pdf (2018).

  13. 13.

    Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine 1, 18 (2018).

  14. 14.

    Wilkinson, M. D. et al. Comment: the fair guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

  15. 15.

    Liang, Y., Chen, Z., Huang, X. & Zeng, L. Analysis of the disease spectrum of hospitalized children in guangdong province. Chin. Med. J. (Engl) 1, 414–418 (2013).

  16. 16.

    WHO. International Statistical Classification of Diseases and Related Health Problems. (World Health Organization, 2004).

  17. 17.

    English–Chinese Medical Dictionary (英汉医学大词典) (Shanghai Scientific and Technical Publishers (上海科学技术出版社), 2015).

  18. 18.

    Lindberg, D. A. B., Humphreys, B. L. & Mccray, A. T. The unified medical language system. Methods Inf. Med. 32, 281–291 (1993).

  19. 19.

    Tweedie, F. J., Singh, S. & Holmes, D. I. Neural network applications in stylometry: the federalist papers. Computers and the Humanities 30, 1–10 (1996).

  20. 20.

    Luong, M.-T., Pham, H. & Manning, C. D. Effective approaches to attention-based neural machine translation. Preprint at https://arxiv.org/abs/1508.04025 (2015).

  21. 21.

    Lipton, Z.C., Kale, D.C. & Wetzel, R.C. Phenotyping of clinical time series with LSTM recurrent neural networks. Preprint at https://arxiv.org/pdf/1510.07641.pdf (2015).

  22. 22.

    Peng, X.B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. IEEE International Conference on Robotics and Automation (ICRA) 3803–3810 (2018).

  23. 23.

    Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 602–610 (2005).

  24. 24.

    Yeung, K.Y. & Ruzzo, W.L. Details of the adjusted rand index and clustering algorithms supplement to the paper ‘an empirical study on Principal Component Analysis for clustering gene expression data. Available at http://faculty.washington.edu/kayee/pca/supp.pdf (2011).

Download references


This study was funded by the National Key Research and Development Program of China (2017YFC1104600 to H.L.), National Natural Science Foundation of China (81771629 to H.X. and 81700882 to J.X.), Guangzhou Women and Children’s Medical Center, Guangzhou Regenerative Medicine and Health Guangdong Laboratory (Innovation and Startup Talents Program 2018GZR031001 to L.Z. and R.H.).

Author information

Author notes

  1. These authors contributed equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu.


  1. Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China

    • Huiying Liang
    • , Guangjian Liu
    • , Daniel S. Kermany
    • , Xin Sun
    • , Liya He
    • , Jie Zhu
    • , Sierra Hewett
    • , Gen Li
    • , Liyan Pan
    • , Rujuan Ling
    • , Shuhua Li
    • , Yongwang Cui
    • , Shusheng Tang
    • , Hong Ye
    • , Xiaoyan Huang
    • , Waner He
    • , Wenqing Liang
    • , Qing Zhang
    • , Jianmin Jiang
    • , Wei Yu
    • , Jianqun Gao
    • , Wanxing Ou
    • , Yingmin Deng
    • , Qiaozhen Hou
    • , Bei Wang
    • , Cuichan Yao
    • , Yan Liang
    • , Shu Zhang
    • , Xiaokang Wu
    • , Jing Li
    • , Daoman Xiang
    • , Wanting He
    • , Yugui Zhou
    • , Kang Zhang
    •  & Huimin Xia
  2. Institute for Genomic Medicine, Institute of Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA

    • Brian Y. Tsui
    • , Sally L. Baxter
    • , Wenjia Cai
    • , Daniel S. Kermany
    • , Jiancong Chen
    • , Pin Tian
    • , Hua Shao
    • , Sierra Hewett
    • , Gen Li
    • , Yaou Duan
    • , Runze Zhang
    • , Sarah Gibson
    • , Charlotte L. Zhang
    • , Oulan Li
    • , Edward D. Zhang
    • , Gabriel Karin
    • , Nathan Nguyen
    • , Xiaokang Wu
    • , Cindy Wen
    • , Jie Xu
    • , Wenqin Xu
    • , Bochu Wang
    • , Winston Wang
    • , Jing Li
    • , Bianca Pizzato
    • , Caroline Bao
    • , Wanting He
    • , Suiqin He
    • , Yugui Zhou
    • , Weldon Haw
    • , Michael Goldbaum
    • , Adriana Tremoulet
    • , Chun-Nan Hsu
    • , Hannah Carter
    •  & Kang Zhang
  3. Hangzhou YITU Healthcare Technology Co. Ltd, Hangzhou, China

    • Hao Ni
    • , Ping Liang
    • , Xuan Zang
    • , Zhiqi Zhang
    •  & Long Zhu
  4. Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and National Clinical Research Center for Respiratory Disease, Guangzhou, China

    • Carolina C. S. Valentim
  5. Guangzhou Kangrui Co. Ltd, Guangzhou, China

    • Lianghong Zheng
    • , Rui Hou
    •  & Huimin Cai
  6. Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China

    • Lianghong Zheng
    • , Rui Hou
    •  & Huimin Cai
  7. Veterans Administration Healthcare System, San Diego, CA, USA

    • Weldon Haw
    •  & Kang Zhang


  1. Search for Huiying Liang in:

  2. Search for Brian Y. Tsui in:

  3. Search for Hao Ni in:

  4. Search for Carolina C. S. Valentim in:

  5. Search for Sally L. Baxter in:

  6. Search for Guangjian Liu in:

  7. Search for Wenjia Cai in:

  8. Search for Daniel S. Kermany in:

  9. Search for Xin Sun in:

  10. Search for Jiancong Chen in:

  11. Search for Liya He in:

  12. Search for Jie Zhu in:

  13. Search for Pin Tian in:

  14. Search for Hua Shao in:

  15. Search for Lianghong Zheng in:

  16. Search for Rui Hou in:

  17. Search for Sierra Hewett in:

  18. Search for Gen Li in:

  19. Search for Ping Liang in:

  20. Search for Xuan Zang in:

  21. Search for Zhiqi Zhang in:

  22. Search for Liyan Pan in:

  23. Search for Huimin Cai in:

  24. Search for Rujuan Ling in:

  25. Search for Shuhua Li in:

  26. Search for Yongwang Cui in:

  27. Search for Shusheng Tang in:

  28. Search for Hong Ye in:

  29. Search for Xiaoyan Huang in:

  30. Search for Waner He in:

  31. Search for Wenqing Liang in:

  32. Search for Qing Zhang in:

  33. Search for Jianmin Jiang in:

  34. Search for Wei Yu in:

  35. Search for Jianqun Gao in:

  36. Search for Wanxing Ou in:

  37. Search for Yingmin Deng in:

  38. Search for Qiaozhen Hou in:

  39. Search for Bei Wang in:

  40. Search for Cuichan Yao in:

  41. Search for Yan Liang in:

  42. Search for Shu Zhang in:

  43. Search for Yaou Duan in:

  44. Search for Runze Zhang in:

  45. Search for Sarah Gibson in:

  46. Search for Charlotte L. Zhang in:

  47. Search for Oulan Li in:

  48. Search for Edward D. Zhang in:

  49. Search for Gabriel Karin in:

  50. Search for Nathan Nguyen in:

  51. Search for Xiaokang Wu in:

  52. Search for Cindy Wen in:

  53. Search for Jie Xu in:

  54. Search for Wenqin Xu in:

  55. Search for Bochu Wang in:

  56. Search for Winston Wang in:

  57. Search for Jing Li in:

  58. Search for Bianca Pizzato in:

  59. Search for Caroline Bao in:

  60. Search for Daoman Xiang in:

  61. Search for Wanting He in:

  62. Search for Suiqin He in:

  63. Search for Yugui Zhou in:

  64. Search for Weldon Haw in:

  65. Search for Michael Goldbaum in:

  66. Search for Adriana Tremoulet in:

  67. Search for Chun-Nan Hsu in:

  68. Search for Hannah Carter in:

  69. Search for Long Zhu in:

  70. Search for Kang Zhang in:

  71. Search for Huimin Xia in:


H.L., B.T., H.N., W.C., S.L.B., G. Liu, D.S.K., X. S., C.C.S.V., P.T., H.S., J.C., L. H., J.Z., L.Z., R.H., S.H., G. Li, P.L., X.Z., Z.Z., L.P., H.C., R.L., S.L., Y.C., S.T., H.Y., X.H., W. He, W.L., Q.Z., J.J., W.Y., J.G., W.O., Y. Deng, Q.H., B. Wang, C.Y., Y.L., S.Z., Y. Duan, R.Z., S.G., C.L.Z., O.L., E.D.Z., G.K., X.W., C.W., N.N., J.X., W.X., B. Wang, W.W., J.L., B.P., C.B., D.X., W. He, S.H., Y.Z., W. Haw, M.G., A.T., C.-N.H., H.C., L.Z., H.X. and K.Z. collected and analyzed the data. X.H. and K.Z. conceived the project. K.Z., S.L.B., B.T., H.L., and H.X. wrote the manuscript. All authors discussed the results and reviewed the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Kang Zhang or Huimin Xia.

Extended data

  1. Extended Data 1 Unsupervised clustering of NLP extracted textual features from pediatric diseases.

    The diagnostic system analyzed the EHRs in the absence of a defined classification system. This grouping structure reflects the detection of trends in clinical features without pre-defined labeling or human input. The clustered blocks are marked with the boxes with grey lines.

  2. Extended Data 2 Design of the natural language processing information extraction model.

    Segmented sentences from the raw text of the EHR were embedded using word2vec. The LSTM model then generated the structured records in a query–answer format. This schematic illustrates the process using the free-text ‘lesion in the upper left lobe of patient’s lung’ as an example.

Supplementary Information

  1. Reporting Summary

  2. Supplementary Tables

    Supplementary Tables 1–9

About this article

Publication history




Issue Date



Further reading