Introduction

Large language models (LLM) are complex predictive text algorithms that generate sentences and paragraphs from a prior sequence of words [1]. With appropriate training, they can act conversationally by generating text following a user-provided instruction or question. Commonly known modern LLMs include OpenAI’s GPT-3 and GPT-4, Google’s PaLM-2, and Meta’s Llama-2. These models vary in design and performance. Information on accessing these models is included in the supplementary information.

Each implementation of the same LLM (e.g. GPT-4) may behave differently due to differences in ‘fine-tuning’ training, plugin access [2], and initial instructions [3]. Early evidence suggests, when answering ophthalmology examination questions, GPT-4 via the ChatGPT or New Bing chatbot is the most performant [2, 4,5,6]. Understanding how to effectively employ LLMs may have significant utility for clinical education and, in future, clinical practice.

What can LLMs do?

LLMs may be effective in language tasks like writing, editing, and summarizing. For querying, LLMs excel at questions that are not easily web-searchable, or fact-finding without searchable keywords. LLMs can also be queried to write code in R and other languages. LLMs perform mathematical operations inaccurately [7]. Plugins like embedded calculators (Supplementary Information) can address specific LLM limitations.

Engaging LLMs

LLMs are engaged through text input. Slight variations in input text can cause significant variations in output [8]. Resultingly, high-quality prompting, or ‘prompt engineering’, is critical in optimizing LLM outputs.

Asking the right question

It is widely accepted that providing specific instructions to LLMs yields more useful outputs [9]. For example, requesting a summary of a condition’s epidemiology, pathophysiology, diagnosis, and treatment is generally superior to asking for general information (Box 1). Additionally, using specifiers such as “Provide a brief/detailed summary of”, “Explain at a high-school/college-grade/consultant level”, and “Present your findings in a table/as dot points” sets specific parameters to allow LLMs to generate more useful outputs.

Preparing the LLM

‘Priming’ LLMs for expected inputs and responses is an emerging practice [2]. For example: “I am going to provide X; I want you to output Y”. This strategy increases the chance of a useful output. Example outputs can also be offered to LLMs to style-mimic. Online libraries of example ‘guiding prompts’ are emerging (for example https://www.learnprompt.org/act-as-chat-gpt-prompts/).

Notably, certain instructions could worsen output quality. ChatGPT’s outputs are most accurate after explaining its reasoning [10]. Therefore, prompts that explicitly request LLMs omit reasoning could result in lower-quality outputs.

Additionally, during the same session, LLMs will consider previous inputs and outputs in response [7]. Given their text-completion ‘urge’, LLMs will tend to mimic the format of prior outputs. This property may lower user workload as guiding prompts need only be typed once. Beginning a new session is prudent for unrelated conversations to reduce memory retention bias from prior outputs [5, 11].

It is also an emerging practice to tell LLMs, “If you are not sure about the answer, say you don’t know”. These uncertainty prompts may reduce, but not prevent, inaccurate outputs in certain tasks [12].

Completing tasks

As LLMs are effective information summarizers, they may streamline the non-creative parts of ophthalmological writing, such as abstract creation [13, 14] or medical letter generation [15]. However, ChatGPT may omit or hallucinate information [14, 15], so human oversight is required [15]. Additionally, the creative components of medical writing (for example, drawing conclusions from data) are most accurately performed by a credible human [13].

As these models are pre-trained to complete text, LLMs occasionally fail to summarize or edit incomplete documents. This is due to the ‘urge’ to complete the inputted text. In these situations, it is best to provide LLMs with clearly fragmented or completed text.

Powerful prompts

Powerful prompts guide LLMs in higher-order tasks, such as requesting more information for a conclusion. An LLM employed to diagnose ophthalmological conditions could be asked: “Patient X is [known information]. Keep asking me questions about patient X until you have sufficient information to arrive at a diagnosis.” This method can minimize uncertainty and reduce the generality of responses provided by an LLM.

Another powerful prompt pattern is asking LLMs for feedback. For example, it is possible to ask LLMs for feedback on a patient handout and then request the LLM implement such feedback, as shown in Box 2. This process can be iterated multiple times.

Lastly, when unsure of how to prompt an LLM, it is possible to ask LLMs like ChatGPT to ask clarifying questions about a desired task until it has sufficient information to write its own prompt. The supplementary information includes an example prompt-generating prompt [16].

Online or local

Critically, data provided to online LLMs such as ChatGPT and Bard may be periodically reviewed by humans [17] or be used to train models [18]. Due to this insecurity, confidential data, even deidentified, should never be shared with an online LLM [19]. It is possible to run some LLMs on-device, for example GPT4ALL (https://gpt4all.io) and Llama-2 (https://ai.meta.com/llama/). However, this may result in an unfavorable trade-off between model capability and speed on consumer-grade systems. Consequently, enterprise-hosted and institutionally and ethics committee approved LLMs may be the most appropriate LLM implementation for analysis of confidential information.

What limits LLMs?

While powerful tools, LLMs have well-documented limitations [20]. Most significantly being inaccuracies [20], potential lack of knowledge of specialist fields and, for GPT-3 and GPT-4, knowledge of events and advancements following September 2021 [7]. Furthermore, LLM’s performance may vary; ChatGPT’s capability (GPT-3 and GPT-4) declined in early 2023 [21], potentially due to increased safety training [10]. In the future, less safe models explicitly trained for professionals, such as Med-PaLM-2, may allow ophthalmologists to access more powerful LLMs.

Conclusion

The use of LLMs in ophthalmology is emerging and largely untested. Understanding the function of an LLM as a powerful but fallible predictive text algorithm may guide optimal use of this technology. When using LLMs, choosing deliberately powerful prompts such as the forms suggested above will result in the highest quality use of LLMs in ophthalmology.