Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.

Data availability

We have made available the Jupyter notebook that we used in constructing and validating the hierarchical logistic regression models: https://s3.cn-north-1.amazonaws.com.cn/ped.emr/Data/hierachical_logistic_regression.ipynb. To protect patient confidentiality, we have deposited de-identified aggregated patient data in a secured and patient confidentiality compliant cloud in China in concordance with data security regulations. Data access can be requested by writing to the corresponding authors. All data access requests will be reviewed and (if successful) granted by the Data Access Committee.

Additional information

This study was funded by the National Key Research and Development Program of China (2017YFC1104600 to H.L.), National Natural Science Foundation of China (81771629 to H.X. and 81700882 to J.X.), Guangzhou Women and Children’s Medical Center, Guangzhou Regenerative Medicine and Health Guangdong Laboratory (Innovation and Startup Talents Program 2018GZR031001 to L.Z. and R.H.).

Author information

Author notes

  1. These authors contributed equally: Huiying Liang, Brian Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu.


  1. Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China

    • Huiying Liang
    • , Guangjian Liu
    • , Daniel S. Kermany
    • , Xin Sun
    • , Liya He
    • , Jie Zhu
    • , Sierra Hewett
    • , Gen Li
    • , Liyan Pan
    • , Rujuan Ling
    • , Shuhua Li
    • , Yongwang Cui
    • , Shusheng Tang
    • , Hong Ye
    • , Xiaoyan Huang
    • , Waner He
    • , Wenqing Liang
    • , Qing Zhang
    • , Jianmin Jiang
    • , Wei Yu
    • , Jianqun Gao
    • , Wanxing Ou
    • , Yingmin Deng
    • , Qiaozhen Hou
    • , Bei Wang
    • , Cuichan Yao
    • , Yan Liang
    • , Shu Zhang
    • , Xiaokang Wu
    • , Jing Li
    • , Daoman Xiang
    • , Wanting He
    • , Yugui Zhou
    • , Kang Zhang
    •  & Huimin Xia
  2. Institute for Genomic Medicine, Institute of Engineering in Medicine, and Shiley Eye Institute, University of California, San Diego, La Jolla, CA, USA

    • Brian Y. Tsui
    • , Sally L. Baxter
    • , Wenjia Cai
    • , Daniel S. Kermany
    • , Jiancong Chen
    • , Pin Tian
    • , Hua Shao
    • , Sierra Hewett
    • , Gen Li
    • , Yaou Duan
    • , Runze Zhang
    • , Sarah Gibson
    • , Charlotte L. Zhang
    • , Oulan Li
    • , Edward D. Zhang
    • , Gabriel Karin
    • , Nathan Nguyen
    • , Xiaokang Wu
    • , Cindy Wen
    • , Jie Xu
    • , Wenqin Xu
    • , Bochu Wang
    • , Winston Wang
    • , Jing Li
    • , Bianca Pizzato
    • , Caroline Bao
    • , Wanting He
    • , Suiqin He
    • , Yugui Zhou
    • , Weldon Haw
    • , Michael Goldbaum
    • , Adriana Tremoulet
    • , Chun-Nan Hsu
    • , Hannah Carter
    •  & Kang Zhang
  3. Hangzhou YITU Healthcare Technology Co. Ltd, Hangzhou, China

    • Hao Ni
    • , Ping Liang
    • , Xuan Zang
    • , Zhiqi Zhang
    •  & Long Zhu
  4. Department of Thoracic Surgery/Oncology, First Affiliated Hospital of Guangzhou Medical University, China State Key Laboratory and National Clinical Research Center for Respiratory Disease, Guangzhou, China

    • Carolina C. S. Valentim
  5. Guangzhou Kangrui Co. Ltd, Guangzhou, China

    • Lianghong Zheng
    • , Rui Hou
    •  & Huimin Cai
  6. Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou, China

    • Lianghong Zheng
    • , Rui Hou
    •  & Huimin Cai
  7. Veterans Administration Healthcare System, San Diego, CA, USA

    • Weldon Haw
    •  & Kang Zhang


H.L., B.T., H.N., W.C., S.L.B., G. Liu, D.S.K., X. S., C.C.S.V., P.T., H.S., J.C., L. H., J.Z., L.Z., R.H., S.H., G. Li, P.L., X.Z., Z.Z., L.P., H.C., R.L., S.L., Y.C., S.T., H.Y., X.H., W. He, W.L., Q.Z., J.J., W.Y., J.G., W.O., Y. Deng, Q.H., B. Wang, C.Y., Y.L., S.Z., Y. Duan, R.Z., S.G., C.L.Z., O.L., E.D.Z., G.K., X.W., C.W., N.N., J.X., W.X., B. Wang, W.W., J.L., B.P., C.B., D.X., W. He, S.H., Y.Z., W. Haw, M.G., A.T., C.-N.H., H.C., L.Z., H.X. and K.Z. collected and analyzed the data. X.H. and K.Z. conceived the project. K.Z., S.L.B., B.T., H.L., and H.X. wrote the manuscript. All authors discussed the results and reviewed the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Kang Zhang or Huimin Xia.

Extended data

  1. Extended Data 1 Unsupervised clustering of NLP extracted textual features from pediatric diseases.

    The diagnostic system analyzed the EHRs in the absence of a defined classification system. This grouping structure reflects the detection of trends in clinical features without pre-defined labeling or human input. The clustered blocks are marked with the boxes with grey lines.

  2. Extended Data 2 Design of the natural language processing information extraction model.

    Segmented sentences from the raw text of the EHR were embedded using word2vec. The LSTM model then generated the structured records in a query–answer format. This schematic illustrates the process using the free-text ‘lesion in the upper left lobe of patient’s lung’ as an example.

Supplementary Information

  1. Reporting Summary

  2. Supplementary Tables

    Supplementary Tables 1–9

