Introduction
Large language models (LLM) are complex predictive text algorithms that generate sentences and paragraphs from a prior sequence of words [1]. With appropriate training, they can act conversationally by generating text following a user-provided instruction or question. Commonly known modern LLMs include OpenAI’s GPT-3 and GPT-4, Google’s PaLM-2, and Meta’s Llama-2. These models vary in design and performance. Information on accessing these models is included in the supplementary information.
Each implementation of the same LLM (e.g. GPT-4) may behave differently due to differences in ‘fine-tuning’ training, plugin access [2], and initial instructions [3]. Early evidence suggests, when answering ophthalmology examination questions, GPT-4 via the ChatGPT or New Bing chatbot is the most performant [2, 4,5,6]. Understanding how to effectively employ LLMs may have significant utility for clinical education and, in future, clinical practice.
What can LLMs do?
LLMs may be effective in language tasks like writing, editing, and summarizing. For querying, LLMs excel at questions that are not easily web-searchable, or fact-finding without searchable keywords. LLMs can also be queried to write code in R and other languages. LLMs perform mathematical operations inaccurately [7]. Plugins like embedded calculators (Supplementary Information) can address specific LLM limitations.
Engaging LLMs
LLMs are engaged through text input. Slight variations in input text can cause significant variations in output [8]. Resultingly, high-quality prompting, or ‘prompt engineering’, is critical in optimizing LLM outputs.
Asking the right question
It is widely accepted that providing specific instructions to LLMs yields more useful outputs [9]. For example, requesting a summary of a condition’s epidemiology, pathophysiology, diagnosis, and treatment is generally superior to asking for general information (Box 1). Additionally, using specifiers such as “Provide a brief/detailed summary of”, “Explain at a high-school/college-grade/consultant level”, and “Present your findings in a table/as dot points” sets specific parameters to allow LLMs to generate more useful outputs.
Preparing the LLM
‘Priming’ LLMs for expected inputs and responses is an emerging practice [2]. For example: “I am going to provide X; I want you to output Y”. This strategy increases the chance of a useful output. Example outputs can also be offered to LLMs to style-mimic. Online libraries of example ‘guiding prompts’ are emerging (for example https://www.learnprompt.org/act-as-chat-gpt-prompts/).
Notably, certain instructions could worsen output quality. ChatGPT’s outputs are most accurate after explaining its reasoning [10]. Therefore, prompts that explicitly request LLMs omit reasoning could result in lower-quality outputs.
Additionally, during the same session, LLMs will consider previous inputs and outputs in response [7]. Given their text-completion ‘urge’, LLMs will tend to mimic the format of prior outputs. This property may lower user workload as guiding prompts need only be typed once. Beginning a new session is prudent for unrelated conversations to reduce memory retention bias from prior outputs [5, 11].
It is also an emerging practice to tell LLMs, “If you are not sure about the answer, say you don’t know”. These uncertainty prompts may reduce, but not prevent, inaccurate outputs in certain tasks [12].
Completing tasks
As LLMs are effective information summarizers, they may streamline the non-creative parts of ophthalmological writing, such as abstract creation [13, 14] or medical letter generation [15]. However, ChatGPT may omit or hallucinate information [14, 15], so human oversight is required [15]. Additionally, the creative components of medical writing (for example, drawing conclusions from data) are most accurately performed by a credible human [13].
As these models are pre-trained to complete text, LLMs occasionally fail to summarize or edit incomplete documents. This is due to the ‘urge’ to complete the inputted text. In these situations, it is best to provide LLMs with clearly fragmented or completed text.
Powerful prompts
Powerful prompts guide LLMs in higher-order tasks, such as requesting more information for a conclusion. An LLM employed to diagnose ophthalmological conditions could be asked: “Patient X is [known information]. Keep asking me questions about patient X until you have sufficient information to arrive at a diagnosis.” This method can minimize uncertainty and reduce the generality of responses provided by an LLM.
Another powerful prompt pattern is asking LLMs for feedback. For example, it is possible to ask LLMs for feedback on a patient handout and then request the LLM implement such feedback, as shown in Box 2. This process can be iterated multiple times.
Lastly, when unsure of how to prompt an LLM, it is possible to ask LLMs like ChatGPT to ask clarifying questions about a desired task until it has sufficient information to write its own prompt. The supplementary information includes an example prompt-generating prompt [16].
Online or local
Critically, data provided to online LLMs such as ChatGPT and Bard may be periodically reviewed by humans [17] or be used to train models [18]. Due to this insecurity, confidential data, even deidentified, should never be shared with an online LLM [19]. It is possible to run some LLMs on-device, for example GPT4ALL (https://gpt4all.io) and Llama-2 (https://ai.meta.com/llama/). However, this may result in an unfavorable trade-off between model capability and speed on consumer-grade systems. Consequently, enterprise-hosted and institutionally and ethics committee approved LLMs may be the most appropriate LLM implementation for analysis of confidential information.
What limits LLMs?
While powerful tools, LLMs have well-documented limitations [20]. Most significantly being inaccuracies [20], potential lack of knowledge of specialist fields and, for GPT-3 and GPT-4, knowledge of events and advancements following September 2021 [7]. Furthermore, LLM’s performance may vary; ChatGPT’s capability (GPT-3 and GPT-4) declined in early 2023 [21], potentially due to increased safety training [10]. In the future, less safe models explicitly trained for professionals, such as Med-PaLM-2, may allow ophthalmologists to access more powerful LLMs.
Conclusion
The use of LLMs in ophthalmology is emerging and largely untested. Understanding the function of an LLM as a powerful but fallible predictive text algorithm may guide optimal use of this technology. When using LLMs, choosing deliberately powerful prompts such as the forms suggested above will result in the highest quality use of LLMs in ophthalmology.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 6000–10. (NIPS’17).
Kleinig O, Gao C, Bacchi S. This too shall pass: the performance of ChatGPT-3.5, ChatGPT-4 and New Bing in an Australian medical licensing examination. Med J Aust. [Internet]. https://doi.org/10.5694/mja2.52061.
Edwards B. Ars Technica. 2023 [cited 2023 Jul 23]. AI-powered Bing Chat spills its secrets via prompt injection attack [Updated]. https://arstechnica.com/information-technology/2023/02/ai-powered-bing-chat-spills-its-secrets-via-prompt-injection-attack/.
Raimondi R, Tzoumas N, Salisbury T, Di Simplicio S, Romano MR. Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye. 2023;9:1–4.
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324.
Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board–style questions. Am J Ophthalmol. 2023;254:141–9.
OpenAI. GPT-4 Technical Report [Internet]. arXiv [Preprint] 2023 [cited 2023 Jul 23]. Available from: http://arxiv.org/abs/2303.08774.
Reeves B, Sarsa S, Prather J, Denny P, Becker BA, Hellas A, et al. Evaluating the performance of code generation models for solving Parsons problems with small prompt variations. In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V 1 [Internet]. New York, NY, USA: Association for Computing Machinery; 2023. p. 299–305. https://doi.org/10.1145/3587102.3588805.
Ekin S. Prompt engineering for ChatGPT: a quick guide to techniques, tips, and best practices [Internet]. TechRxiv [Preprint] 2023 [cited 2023 Jul 23]. Available from: https://www.techrxiv.org/articles/preprint/Prompt_Engineering_For_ChatGPT_A_Quick_Guide_To_Techniques_Tips_And_Best_Practices/22683919/2.
Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of artificial general intelligence: early experiments with GPT-4 [Internet]. arXiv [Preprint] 2023 [cited 2023 Jul 23]. Available from: http://arxiv.org/abs/2303.12712.
Fijačko N, Gosak L, Štiglic G, Picard CT, Douma MJ. Can ChatGPT pass the life support exams without entering the American Heart Association Course? Resuscitation [Internet]. 2023;185. https://www.resuscitationjournal.com/article/S0300-9572(23)00045-X/fulltext.
Zhu D, Chen J, Haydarov K, Shen X, Zhang W, Elhoseiny M. ChatGPT asks, BLIP-2 answers: automatic questioning towards enriched visual descriptions [Internet]. arXiv [Preprint] 2023 [cited 2023 Sep 1]. Available from: http://arxiv.org/abs/2303.06594.
Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307:e230171.
Babl FE, Babl MP. Generative artificial intelligence: can ChatGPT write a quality abstract? Emerg Med Australas. [Internet]. https://onlinelibrary.wiley.com/doi/abs/10.1111/1742-6723.14233.
Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5:e179–81.
Littlefield B. PromptGenerator [Internet]. ChatGPT users - Skool. 2023. https://www.skool.com/chatgpt/promptgenerator?p=1e5ede93.
Google. Bard Help. Bard Privacy Help Hub. https://support.google.com/bard/answer/13594961?hl=en#human_review.
OpenAI. OpenAI. Privacy policy. Available from: https://openai.com/policies/privacy-policy.
Attowooll J. NewsGP. ‘Extremely unwise’: warning over use of ChatGPT for medical notes. 2023. https://www1.racgp.org.au/newsgp/clinical/extremely-unwise-warning-over-use-of-chatgpt-for-m.
Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Health. 2023;11:887.
Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? [Internet]. arXiv [Preprint] 2023 [cited 2023 Jul 23]. Available from: http://arxiv.org/abs/2307.09009.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions.
Author information
Authors and Affiliations
Contributions
All authors contributed to the conception of the work and drafted and approved the manuscript. Author OK also performed the major literature review and completed the first manuscript draft. Authors SB and WOC were the principal supervisors of this piece and provided major editorial guidance.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kleinig, O., Gao, C., Kovoor, J.G. et al. How to use large language models in ophthalmology: from prompt engineering to protecting confidentiality. Eye 38, 649–653 (2024). https://doi.org/10.1038/s41433-023-02772-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41433-023-02772-w
This article is cited by
-
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs
npj Digital Medicine (2024)