Abstract
Large-language models like ChatGPT have recently received a great deal of attention. One area of interest pertains to how these models could be used in biomedical contexts, including related to human genetics. To assess one facet of this, we compared the performance of ChatGPT versus human respondents (13,642 human responses) in answering 85 multiple-choice questions about aspects of human genetics. Overall, ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for human respondents. Both ChatGPT and humans performed better on memorization-type questions versus critical thinking questions (p < 0.0001). When asked the same question multiple times, ChatGPT frequently provided different answers (16% of initial responses), including for both initially correct and incorrect answers, and gave plausible explanations for both correct and incorrect answers. ChatGPT’s performance was impressive, but currently demonstrates significant shortcomings for clinical or other high-stakes use. Addressing these limitations will be important to guide adoption in real-life situations.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Can ChatGPT understand genetics?
European Journal of Human Genetics Open Access 05 July 2023
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout

Data availability
All data used and presented are available in the paper and supplementary files.
References
Ledgister Hanchard SE, Dwyer MC, Liu S, Hu P, Tekendo-Ngongang C, Waikel RL, et al. Scoping review and classification of deep learning in medical genetics. Genet Med. 2022;24:1593–603.
Schaefer J, Lehne M, Schepers J, Prasser F, Thun S. The use of machine learning in rare diseases: a scoping review. Orphanet J Rare Dis. 2020;15:145.
Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.
Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826.
Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 2019;176:535–48.e24.
Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36:983–7.
DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell. 2021;3:610–9.
Tekendo-Ngongang C, Owosela B, Fleischer N, Addissie YA, Malonga B, Badoe E, et al. Rubinstein-Taybi syndrome in diverse populations. Am J Med Genet A 2020;182:2939–50.
Solomon BD. Medical Genetics and Genomics: Questions for Board Review. Wiley, Hoboken, 2022.
Funding
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.
Author information
Authors and Affiliations
Contributions
DD contributed to: formal analysis, investigation, methodology, and writing-review & editing. BDS contributed conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, and writing-original draft.
Corresponding author
Ethics declarations
Competing interests
The authors receive salary and research support from the intramural program of the National Human Genome Research Institute. BDS is the co-Editor-in-Chief of the American Journal of Medical Genetics, and has published some of the questions mentioned in this study in a book, as well as other questions [12]. Both editing/publishing activities are conducted as an approved outside activity, separate from his US Government role.
Ethics approval
No individual data were collected or analyzed (there was no access to individual respondent data); per discussion with NIH bioethics/IRB, the analyses described here are considered “not human subjects research” and do not require IRB review or formal exemption.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Duong, D., Solomon, B.D. Analysis of large-language model versus human performance for genetics questions. Eur J Hum Genet (2023). https://doi.org/10.1038/s41431-023-01396-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41431-023-01396-8
This article is cited by
-
Can ChatGPT understand genetics?
European Journal of Human Genetics (2023)
-
Response to correspondence regarding “Analysis of large-language model versus human performance for genetics questions”
European Journal of Human Genetics (2023)
-
Importance of critical thinking to understand ChatGPT
European Journal of Human Genetics (2023)
-
Code Interpreter for Bioinformatics: Are We There Yet?
Annals of Biomedical Engineering (2023)