Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Large language models in medicine


Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Fig. 1: LLMs developed in recent years.
Fig. 2: Fine-tuning an LLM (GPT-3.5) to develop an LLM chatbot (ChatGPT).
Fig. 3: Limitations, priorities for research and development and potential use-cases of LLM applications.


  1. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

    CAS  PubMed  Google Scholar 

  2. Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit. Med. 4, 65 (2021).

    PubMed  PubMed Central  Google Scholar 

  3. Liddy, E. Natural language processing. In Encyclopedia of Library and Information Science (eds Kent, A. & Lancour, H.)(Marcel Decker, 2001).

  4. Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82, 3713–3744 (2023).

    PubMed  Google Scholar 

  5. Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems Vol. 33 1877–1901 (Curran Associates, 2020).

  6. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    CAS  PubMed  Google Scholar 

  7. Kaplan, J. et al. Scaling laws for neural language models. Preprint at arXiv (2020).

  8. Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. Preprint at arXiv (2020).

  9. Thoppilan, R. et al. LaMDA: language models for dialog applications. Preprint at arXiv (2022).

  10. Zeng, A. et al. GLM-130B: an open bilingual pre-trained model. Preprint at arXiv (2022).

  11. Amatriain, X. Transformer models: an introduction and catalog. Preprint at arXiv (2023).

  12. Introducing ChatGPT.

  13. Ouyang, L. et al. Training language models to follow instructions with human feedback. Preprint at arXiv (2022).

  14. OpenAI. GPT-4 technical report. Preprint at arXiv (2023).

  15. Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).

    PubMed  PubMed Central  Google Scholar 

  16. Thirunavukarasu, A. J. et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med. Educ. 9, e46599 (2023).

    PubMed  PubMed Central  Google Scholar 

  17. Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).

    PubMed  Google Scholar 

  18. Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

    CAS  PubMed  Google Scholar 

  19. Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. (2018).

  20. Radford, A. et al. Language models are unsupervised multitask learners. Preprint at Semantic Scholar (2018).

  21. Qiu, X. et al. Pre-trained models for natural language processing: a survey. Sci. China Technol. Sci. 63, 1872–1897 (2020).

    Google Scholar 

  22. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at arXiv (2023).

  23. Dennean, K., Gantori, S., Limas, D. K., Pu, A. & Gilligan, R. Let’s chat about ChatGPT. (2023).

  24. Dai, D. et al. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. Preprint at arXiv (2022).

  25. Confirmed: the new Bing runs on OpenAI’s GPT-4.’s-GPT-4/ (2023).

  26. Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. Preprint at arXiv (2022).

  27. Shuster, K. et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. Preprint at arXiv (2022).

  28. Shuster, K. et al. Language models that seek for knowledge: modular search & generation for dialogue and prompt completion. Preprint at arXiv (2022).

  29. Anil, R. et al. PaLM 2 technical report. Preprint at arXiv (2023).

  30. HuggingChat.

  31. Taori, R. et al. Alpaca: a strong, replicable instruction-following model. Preprint at (2023).

  32. OpenAI. GPT-4 system card. (2023).

  33. Lacoste, A., Luccioni, A., Schmidt, V. & Dandres, T. Quantifying the carbon emissions of machine learning. Preprint at arXiv (2019).

  34. Patterson, D. et al. The carbon footprint of machine learning training will plateau, then shrink. Preprint at arXiv (2022).

  35. Strubell, E., Ganesh, A. & McCallum, A. Energy and policy considerations for deep learning in NLP. Preprint at arXiv (2019).

  36. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (Association for Computing Machinery, 2021).

  37. ARK Investment Management LLC. Big Ideas 2023. (2023).

  38. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv (2023).

  39. Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at arXiv (2023).

  40. Looi, M.-K. Sixty seconds on… ChatGPT. BMJ 380, p205 (2023).

    Google Scholar 

  41. Pause giant AI experiments: an open letter. Future of Life Institute. (2023).

  42. Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).

    PubMed  Google Scholar 

  43. Singhal, K. et al. Large language models encode clinical knowledge. Preprint at arXiv (2022).

  44. Gilson, A. et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).

    PubMed  PubMed Central  Google Scholar 

  45. Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).

    PubMed  PubMed Central  Google Scholar 

  46. Nastasi, A. J., Courtright, K. R., Halpern, S. D. & Weissman, G. E. Does ChatGPT provide appropriate and equitable medical advice?: a vignette-based, clinical evaluation across care contexts. Preprint at medRxiv (2023).

  47. Rao, A. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow. Preprint at medRxiv (2023).

  48. Levine, D. M. et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model. Preprint at medRxiv (2023).

  49. Nov, O., Singh, N. & Mann, D. M. Putting ChatGPT’s medical advice to the (Turing) test. Preprint at medRxiv (2023).

  50. Thirunavukarasu, A. J. Large language models will not replace healthcare professionals: curbing popular fears and hype. J. R. Soc. Med. 116, 181–182 (2023).

  51. Kraljevic, Z. et al. Foresight—Generative Pretrained Transformer (GPT) for modelling of patient timelines using EHRs. Preprint at arXiv (2023).

  52. Shao, Y. et al. Hybrid value-aware transformer architecture for joint learning from longitudinal and non-longitudinal clinical data. Preprint at medRxiv (2023).

  53. Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307, e230725 (2023).

  54. Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).

    PubMed  Google Scholar 

  55. Spataro, J. Introducing Microsoft 365 Copilot—your copilot for work. The Official Microsoft Blog. (2023).

  56. Ghahramani, Z. Introducing PaLM 2. Google. (2023).

  57. Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries? Lancet Digit. Health 5, e107–e108 (2023).

    CAS  PubMed  Google Scholar 

  58. Will ChatGPT transform healthcare? Nat. Med. 29, 505–506 (2023).

  59. Our latest health AI research updates. Google. (2023).

  60. Khan, S. Harnessing GPT-4 so that all students benefit. A nonprofit approach for equal access! Khan Academy Blog. (2023).

  61. Duolingo Team. Introducing Duolingo Max, a learning experience powered by GPT-4. Duolingo Blog. (2023).

  62. Han, Z., Battaglia, F., Udaiyar, A., Fooks, A. & Terlecky, S. R. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Preprint at medRxiv (2023).

  63. Benoit, J. R. A. ChatGPT for clinical vignette generation, revision, and evaluation. Preprint at medRxiv (2023).

  64. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

    CAS  PubMed  Google Scholar 

  65. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Health. 3, 1–23 (2022).

    Google Scholar 

  66. Salganik, M. Can ChatGPT—and its successors—go from cool to tool? Freedom to Tinker. (2023).

  67. Zhavoronkov, A. Caution with AI-generated content in biomedicine. Nat. Med 29, 532 (2023).

    CAS  PubMed  Google Scholar 

  68. Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).

    PubMed  PubMed Central  Google Scholar 

  69. Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. Preprint at arXiv (2022).

  70. Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. Preprint at arXiv (2020).

  71. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. (2023).

  72. Mai, D. H. A., Nguyen, L. T. & Lee, E. Y. TSSNote-CyaPromBERT: development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front. Genet. 13, 1067562 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Yan, C. et al. A multifaceted benchmarking of synthetic electronic health record generation models. Nat. Commun. 13, 7609 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. OpenAI. Model index for researchers.

  76. Ball, P. The lightning-fast quest for COVID vaccines—and what it means for other diseases. Nature 589, 16–18 (2021).

    CAS  PubMed  Google Scholar 

  77. Hallin, J. et al. Anti-tumor efficacy of a potent and selective non-covalent KRASG12D inhibitor. Nat. Med. 28, 2171–2182 (2022).

    CAS  PubMed  Google Scholar 

  78. Babbage, C. Passages from the Life of a Philosopher (Longman, Green, Longman, Roberts, & Green, 1864).

  79. Total data volume worldwide 2010-2025. Statista.

  80. Villalobos, P. et al. Will we run out of data? An analysis of the limits of scaling datasets in machine learning. Preprint at arXiv (2022).

  81. Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38 (2023).

    Google Scholar 

  82. Alkaissi, H. & McFarlane, S. I. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15, e35179 (2023).

    PubMed  PubMed Central  Google Scholar 

  83. Huang, J. et al. Large language models can self-improve. Preprint at arXiv (2022).

  84. Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. Preprint at arXiv (2023).

  85. Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at arXiv (2022).

  86. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at arXiv (2022).

  87. Zini, J. E. & Awad, M. On the explainability of natural language processing deep models. ACM Comput. Surv. 55, 1–103 (2022).

    Google Scholar 

  88. Barredo Arrieta, A. et al. Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020).

    Google Scholar 

  89. Else, H. Abstracts written by ChatGPT fool scientists. Nature 613, 423–423 (2023).

    CAS  PubMed  Google Scholar 

  90. Taylor, J. ChatGPT’s alter ego, Dan: users jailbreak AI program to get around ethical safeguards. The Guardian (2023).

  91. Perez, F. & Ribeiro, I. Ignore previous prompt: attack techniques for language models. Preprint at arXiv (2022).

  92. Li, X. & Zhang, T. An exploration on artificial intelligence application: from security, privacy and ethic perspective. In 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) 416–420 (Curran Associates, 2017).

  93. Wolford, B. What is GDPR, the EU’s new data protection law? (2018).

  94. Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313 (2023).

    PubMed  Google Scholar 

  95. Yeo-Teh, N. S. L. & Tang, B. L. NLP systems such as ChatGPT cannot be listed as an author because these cannot fulfill widely adopted authorship criteria. Account Res. (2023).

  96. Stokel-Walker, C. ChatGPT listed as author on research papers: many scientists disapprove. Nature 613, 620–621 (2023).

    CAS  PubMed  Google Scholar 

  97. Lehman, E. et al. Do we still need clinical language models? Preprint at arXiv (2023).

  98. Yang, X. et al. GatorTron: a large clinical language model to unlock patient information from unstructured electronic health records. Preprint at arXiv (2022).

  99. Weiner, S. J., Wang, S., Kelly, B., Sharma, G. & Schwartz, A. How accurate is the medical record? A comparison of the physician’s note with a concealed audio recording in unannounced standardized patient encounters. J. Am. Med. Inf. Assoc. 27, 770–775 (2020).

    Google Scholar 

  100. Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124 (2005).

    PubMed  PubMed Central  Google Scholar 

  101. Liebrenz, M., Schleifer, R., Buadze, A., Bhugra, D. & Smith, A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit. Health 5, e105–e106 (2023).

    CAS  PubMed  Google Scholar 

  102. Stokel-Walker C. AI bot ChatGPT writes smart essays—should academics worry? Nature (2022).

  103. Elali, F. R. & Rachid, L. N. AI-generated research paper fabrication and plagiarism in the scientific community. Patterns 4, 100706 (2023).

    PubMed  PubMed Central  Google Scholar 

  104. Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature 613, 612–612 (2023).

  105. Sample, I. Science journals ban listing of ChatGPT as co-author on papers. The Guardian (2023).

  106. Flanagin, A., Bibbins-Domingo, K., Berkwits, M. & Christiansen, S. L. Nonhuman ‘authors’ and implications for the integrity of scientific publication and medical knowledge. JAMA 329, 637–639 (2023).

    PubMed  Google Scholar 

  107. Authorship and contributorship. Cambridge Core.

  108. New AI classifier for indicating AI-written text.

  109. Kirchenbauer, J. et al. A watermark for large language models. Preprint at arXiv (2023).

  110. The Lancet Digital Health. ChatGPT: friend or foe? Lancet Digit. Health 5, e102 (2023).

    CAS  PubMed  Google Scholar 

  111. Mbakwe, A. B., Lourentzou, I., Celi, L. A., Mechanic, O. J. & Dagan, A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLoS Digit. Health 2, e0000205 (2023).

    PubMed  PubMed Central  Google Scholar 

  112. Abid, A., Farooqi, M. & Zou, J. Large language models associate Muslims with violence. Nat. Mach. Intell. 3, 461–463 (2021).

    Google Scholar 

  113. Nangia, N., Vania, C., Bhalerao, R. & Bowman, S. R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1953–1967 (Association for Computational Linguistics, 2020).

  114. Bender, E. M. & Friedman, B. Data statements for natural language processing: toward mitigating system bias and enabling better science. In Transactions of the Association for Computational Linguistics 6, 587–604 (2018).

    Google Scholar 

  115. Li, H. et al. Ethics of large language models in medicine and medical research. Lancet Digit. Health 5, e333–e335 (2023).

  116. Aggarwal, A., Tam, C. C., Wu, D., Li, X. & Qiao, S. Artificial intelligence–based chatbots for promoting health behavioral changes: systematic review. J. Med. Internet Res. 25, e40789 (2023).

    PubMed  PubMed Central  Google Scholar 

  117. Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022).

    CAS  PubMed  Google Scholar 

  118. Friedberg, M. W. et al. Factors affecting physician professional satisfaction and their implications for patient care, health systems, and health policy. RAND Health Q 3, 1 (2014).

    PubMed  PubMed Central  Google Scholar 

  119. Kwee, A., Teo, Z. L. & Ting, D. S. W. Digital health in medicine: important considerations in evaluating health economic analysis. Lancet Reg. Health West Pac. 23, 100476 (2022).

    PubMed  PubMed Central  Google Scholar 

  120. Littmann, M. et al. Validity of machine learning in biology and medicine increased through collaborations across fields of expertise. Nat. Mach. Intell. 2, 18–24 (2020).

    Google Scholar 

Download references


D.S.W.T. is supported by the National Medical Research Council, Singapore (NMCR/HSRG/0087/2018, MOH-000655-00 and MOH-001014-00), the Duke-NUS Medical School (Duke-NUS/RSF/2021/0018 and 05/FY2020/EX/15-A58) and the Agency for Science, Technology and Research (A20H4g2141 and H20C6a0032). These funders were not involved in the conception, execution or reporting of this review.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Daniel Shu Wei Ting.

Ethics declarations

Competing interests

D.S.W.T. holds a patent on a deep learning system for the detection of retinal diseases. The other authors declare no conflicts of interest.

Peer review

Peer review information

Nature Medicine thanks Melissa McCradden, Pranav Rajpurkar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary handling editor: Karen O’Leary, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K. et al. Large language models in medicine. Nat Med 29, 1930–1940 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research