Benchmarking medical large language models

Bakhshandeh, Sadra

doi:10.1038/s44222-023-00097-7

Research Highlight
Published: 24 July 2023

Artificial intelligence

Benchmarking medical large language models

Sadra Bakhshandeh¹

Nature Reviews Bioengineering volume 1, page 543 (2023)Cite this article

1129 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Computer science

Access through your institution

Buy or subscribe

Large language models (LLMs) — a type of artificial intelligence that uses deep learning and large data sets for natural language processing tasks — are being increasingly deployed in a variety of applications. However, applying LLMs in medicine and health care, for example, for triaging of patient concerns, as clinical decision assistant for doctors and biomedical research assistant for scientists, remains challenging owing to the societal and medical implications of potential hallucinations of these models. One way to assess the reliability and knowledge encoded by LLMs is to test their answers to biomedical questions; however, current medical question-answering benchmarks are limited in scope and have typically only considered small language models (a few hundred to a few billion parameters). Now, writing in Nature, Karan Singhal, Shekoofeh Azizi, Alan Karthikesalingam, Vivek Natarajan and team report a multidimensional question-answering benchmark, evaluating the clinical knowledge of fine-tuned variants of the pathways language model (PaLM), a 540-billion parameter, densely activated LLM.

Based on this framework, the team designed a version of PaLM, trained to follow instructions (instruction-tuned), named Flan-PaLM, which substantially outperformed existing state-of-the-art baseline LLMs, with 67.6% accuracy on MedQA, 57.6% on MedMCQA and 79.0% on PubMedQA. Nonetheless, only 61.9% of Flan-PaLM long-form answers were deemed to be aligned with scientific consensus and 29.7% were rated potentially harmful. Applying an instruction prompt tuning strategy, a parameter-efficient alignment technique based on medical domain data and expert clinician demonstrations (called Med-PaLM), improved these readouts to 92.6% (alignment to scientific consensus) and 5.9% (potentially harmful). “Importantly, Med-PaLM not only matched the performance of Flan-PaLM on benchmarks, such as USMLE, but also greatly improved on axes such as factuality of answers, harm, helpfulness and bias, thereby closing the gap with physicians”, says Natarajan.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

References

Original article

Singhal, K. et al. Large language models encode clinical knowledge. Nature https://doi.org/10.1038/s41586-023-06291-2 (2023)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Nature Reviews Bioengineering https://www.nature.com/natrevbioeng/
Sadra Bakhshandeh

Authors

Sadra Bakhshandeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sadra Bakhshandeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bakhshandeh, S. Benchmarking medical large language models. Nat Rev Bioeng 1, 543 (2023). https://doi.org/10.1038/s44222-023-00097-7

Download citation

Published: 24 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s44222-023-00097-7

Benchmarking medical large language models

Subjects

Access options

References

Original article

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Large language models encode clinical knowledge

Search

Quick links

Subjects

Access options

References

Original article

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links