Introduction

Based on date provided by the World Health Organization, traumatic brain injury (TBI) is the third leading cause of death globally1, accounting for nearly half of all injury-related deaths worldwide1,2. Moreover, TBI is a major cause of acquired disability worldwide; however, effective treatment methods are scarce3. Brain trauma can lead to head injuries, skull fractures, brain tissue damage, and, in severe cases, coma, memory loss, and cognitive impairment. Owing to the limited regenerative capacity of the nervous system, the rehabilitation of patients with brain trauma is a lengthy process3.

Recently the use of artificial intelligence (AI) to provide personalized medical services for clinical brain rehabilitation has gained significant attention4. AI offers the advantage of providing prompt diagnostic and therapeutic recommendations for brain rehabilitation. An emerging area of research is the use of Large Language Models (LLM) as a tool for rehabilitation support, which has gained traction in a variety of fields, including chronic pulmonary disease5, rehabilitation education6, and physical and skeletal rehabilitation7,8. Despite advancements in LLM, this technology has limitations, including issues with accuracy and comprehensiveness9,10. LLM may also generate “Hallucinations” 11,12, making them unsuitable for providing professional medical advice. Moreover, the lack of explainability13,14 of the output results makes it difficult for doctors and patients to establish trust when interacting with a “robotic system”.

In the field of GPT technology, the use of agents is considered the latest approach for tackling complex problems15,16. This approach has demonstrated exceptional performance in fields such as programming17, gaming18, and even complex computer tasks19. However, the application of this agent in the medical field remains in the nascent stage. This study therefore aimed to explore the use of an agent technology based on medical guidelines that can provide responses to user inputs. Simultaneously, relevant content from medical guidelines were output within the responses to enhance the explainability of the results.

This study comprised a comparative analysis of the responses between direct GPT-4 and GPT-agents (constructed based on guidelines). A set of brain rehabilitation questions was selected from the doctor-patient Q&A database for assessments. The primary endpoint was a better answer, whereas the secondary endpoints included accuracy, completeness, explainability, and empathy.

Results

ChatGPT-agents question-answering system

Thirty random questions (Supplementary Table 2) were answered, and it was observed that GPT-agents took significantly longer to respond than GPT-4 (54.05 vs. 9.66 s per question). The “Results Evaluation and Source Citation” agent had the longest response time (Table 1, Fig. 1). Regarding word count, GPT-4 answered in an average of 57 words, which was significantly fewer than the average of 371 words for GPT-agents (Fig. 2).

Table 1 The Agents And Function.
Figure 1
figure 1

The flowchart illustrates two processesThe flowchart of the GPT process. (A) Represents the GPT-Agents based on medical guidelines(group GPT-Agents); (B) the direct use of GPT-4(group GPT-4).

Figure 2
figure 2

The Time Consumption and Word Count. (A, B)Time Consumption: the GPT-Agents required more time to answer questions, with an average response time of 54.05 s, whereas GPT-4(direct) takes 9.66 s. (C, D) The Word Count: GPT-Agents generate more words, with an average word count of 371, whereas GPT-4 produces fewer words with an average count of 57. The "Results Evaluation and Source Citation" was the most time-consuming and word count.

Evaluation results

Three evaluators assessed the responses to 30 random questions (Supplementary Table 2). Based on the evaluation results, GPT-agents was found to have provided superior answers in most cases (n = 20, 66.7%) compared to GPT-4 (n = 10, 33.3%). Chi-square analysis revealed that GPT-agents significantly outperformed the GPT-4 group (χ2 = 6.667, p = 0.0098). Further analysis of accuracy evaluation, revealed that the guideline-based GPT-agents (3.8 ± 1.02) outperformed GPT-4 (3.2 ± 0.96, p = 0.0234). However, completeness evaluation showed that both models showed incomplete answers, with no significant difference (2 ± 0.87 vs. 1.7 ± 0.79, p = 0.213). However, in terms of explainability and empathy evaluation, the GPT-agents performed significantly better than GPT-4 (Table 2, Fig. 3).

Table 2 The evaluation results.
Figure 3
figure 3

The Evaluation of answer. (A) The GPT-Agents was the better one for 66.7% of the answers. (B) Accuracy Evaluation: GPT-Agents demonstrated higher accuracy compared to direct GPT-4. (C) Completeness Evaluation: Both models showed incompleteness. (D) Explainability Evaluation: GPT-Agents exhibited significantly better results than GPT-4. (E) Empathy Evaluation: GPT-Agents showed higher empathy compared to GPT-4.

In the response analysis, when faced with information not covered in the guidelines, GPT-agents explicitly indicated “unclear”, instead of fabricating conflicting content with the guidelines (Supplementary Table 2). In the evaluation section of the results, the GPT-agents explicitly indicated whether the answers were correct, and the specific content from the guidelines (Table 3).

Table 3 Few questions and answers.

Discussion

In this study, medical guidelines20 and agents based on GPT-4 were used to answer questions related to TBI rehabilitation. This system automatically evaluates the correctness of the answers, simultaneously providing relevant content from the medical guidelines to enhance explainability. The evaluation revealed that the responses generated by the guideline-based GPT-agents performed better in terms of accuracy, explainability, and empathy than those obtained by directly querying GPT-4.

Brain rehabilitation is a comprehensive and lengthy treatment process involving a variety of aspects including physical therapy, speech therapy, cognitive training, and psychological support21,22. The LLM acquires knowledge from various professional disciplines during training, making it highly suitable for assisting with brain rehabilitation.

Currently, the LLM has demonstrated potential in the medical field23,24, owing to its powerful natural language processing and generation capabilities25,26. However, the direct use of LMM is still limited by certain challenges, such as inaccurate responses or the generation of hallucinations. Agents based on LLM for complex task processing have shown significant advantages. For example, humans typically perform autonomous programming or automation of certain real-world tasks using computers or smartphones. Agents can also be employed in medical tasks such as for dermatological patient-doctor conversations and assessments27. The GPT-agents constructed in this study involved multiple API calls, which results in the generation of lengthier answers, but can increase the response time. Overall, it was found that the GPT-agents had an extended response time compared to GPT-4, but could still provide answers within an average range of 1–2 min, generating an output with a word count between 300 and 700 words (in Chinese). This speed is acceptable for clinical counseling, as it is much shorter than the real-world waiting time in hospitals for treatment.

Traditional direct question-answering systems such as ChatGPT have been found to be limited potential issues related to accuracy28,29 and the generation of hallucinatory responses for medical queries30,31. Medical guidelines and expert consensus thus serve as the cornerstone of clinical practice. GPT-4 has powerful summarization capabilities29, making it a potential tool for guideline classification. In the present study, we observed that after inputting guideline information into the GPT-4, its medical role was significantly activated, leading to improved response accuracy. We further found that the inclusion of guidelines did not directly restrict the agents’ responses. Overall, our GPT-agents could provide suggestions during result evaluation, which offers an alternative when there is no answer available based on the guidelines.

Several studies have previously attempted to improve the accuracy and completeness of LLM by including prompt engineering, fine-tuning, and retraining29,32. Considering the high cost of fine-tuning and retraining, this study focused instead on prompt engineering techniques. By utilizing guideline-based agents to process the guidelines and input them as prompts to the GPT, the accuracy of the agents’ responses improved significantly. This improvement could be attributed to the prompt use of medical guidelines, which better set the context and cultural positioning of GPT. Guidelines are commonly modified to suit the specific healthcare environments in a particular region. Thus, different healthcare environments and conditions may implement slightly different approaches for the same medical issue. For example, Traditional Chinese Medicine is often incorporated into medical guidelines and consensus in China20. This study followed a logical chain of thinking, incorporating knowledge from medical guidelines, and employed multiple evaluative agents to assess the questions and answers. We believe that providing professional medical guidelines and utilizing evaluative agents are superior strategies for enhancing response quality.

Completeness is defined as the accumulation of experience in long-term clinical work involving insights and reflections on multiple dimensions of illness. In the present study, we found that both GPT-agents and GPT-4 were lacking in terms of completeness, indicating that their ability to answer medical questions is still in the early stages of development. Further research should explore whether combining fine-tuned teleology can improve completeness.

Explainability is an important criterion when evaluating the current use of AI in medicine14,33. Because of their large number of parameters, LLMs are inherently difficult to explain. In the present study, the explainability of the results was assessed by referencing the original text of the guidelines. After the answer was evaluated as “correct” or “incorrect”, the related original text of the referenced guideline content was output by the final agent. This significantly increases the explanatory power of the results.

Patients with brain injury often require a lengthy recovery period, and rely on their families for reintegration into society. Empathy can help family members to understand and motivate patients, thus boosting their confidence in treatment. The GPT-4 itself seems to have an advantage over clinical doctors in terms of empathy34,35. In the present study, we found that GPT-agents had significantly enhanced empathy compared to the base GPT-4. This may be attributed to the inclusion of more medical information, which provided the GPT with more precise positioning and allowed it to generate words associated with empathy.

Although this study found that GPT-agents based on medical guidelines could significantly improve medical responses, there are still some limitations which should be considered. First, the use of GPT-agents results in an increase in the cost time. Overall, we found an average increase of 1 min in response time for GPT-agents in our study. However, this may be affected by different areas and Internet environments. Secondly, there is the issue of incomplete answers. Clinical practice is complex and involves multiple disciplines. However, no single guideline can adequately address these complex clinical issues. Guidelines are constantly evolving, and may not always align with the most advanced treatment approaches. As such, these guidelines must be critically evaluated. Incorporating a wide and non-duplicate summary guideline can help to overcome this problem. Third, this study did not employ random double-blinding owing to the inclusion of guideline references in the GPT-agents’ responses, making it impossible to implement blinding on assessors, which could have led to subjectivity in the results. Finally, the actual medical environments in hospitals are complex and variable, involving individual patient situations, medical histories, and symptoms. Additionally, ethical and medical regulations differ across regions. ChatGPT may not have fully considered these factors when answering questions, thus limiting the applicability of its responses. As such, when using the GPT, healthcare professionals and clinical teams must maintain professional judgment, integrate GPT responses with specific patient contexts, and develop the best diagnosis and treatment plans accordingly.

In future research, optimization could be continued through several approaches. First, it will be necessary to further refine the foundational large models, particularly by upgrading them to multimodal models. This is crucial, as many patients with clinical brain injury may not be able to complete typing or speaking tasks. Utilizing various input modes (such as voice and images) can help to broaden accessibility. Second, further studies should explore whether agents based on medical guidelines exhibit common patterns in other conditions, such as rare diseases or critical illnesses. It is essential to determine whether employing guideline-based agents can enhance the responses of LLMs. Finally, as various diseases and medical guidelines intersect, research on recommendation algorithms will be necessary. This algorithm should accurately assess and rank diverse search contents, discerning patients' true intentions, as different diseases involve varying guidelines, and a single condition may have multiple treatment guidelines.

Despite these limitations, our research showed that GPT-agents that rely on medical guidelines hold significant promise for various medical applications. By integrating evidence-based guidelines, these agents can utilize the wealth of knowledge and expertise accumulated through extensive clinical practice and research. This integration not only improves the reliability of the generated responses, but also ensures their alignment with established medical standards and best practices.

Overall, the results of this study showed that GPT-agents have enhanced the accuracy and empathy of responses to TBI rehabilitation questions. This study provides guideline references and demonstrates improved clinical explainability. Compared to the direct use of GPT-4, GPT-agents based on medical guidelines showed improved performance, despite the slight increase in response time. With advances in technology, this delay is expected to be minimized. However, further validation through multicenter trials in a clinical setting is necessary. Overall, this study offers practical insights and establishes the groundwork for the potential theoretical integration of LLM-agents in the field of medicine.

Methods

This study employed a cross-sectional, non-human subject research design. A flowchart of the study design is shown in Fig. 1. As this study did not involve human or animal participants, and ChatGPT/OpenAI could freely access Kaggle.com via the API, Ethical Committee Approval was not required.

Several LLM are currently available; online models include Google's Bard, Microsoft's Bing, Baidu's Wenxin Yiyan, IFLYTEK's Spark, and OpenAI's GPT-series, among others. Offline deployable options include lama and chatglm. Given the popularity of GPT-4 among our research team, GPT-4 was chosen as the foundational model.

In the present study, Multiple agents were constructed using GPT-4, including "Medical Guideline Classification”, “Question Retrieval”, “Matching Evaluation”, “Intelligent Question-Answering”, and “Results Evaluation and Source Citation” (Fig. 1) . The knowledge for the agents was derived from expert consensus or guidelines on brain injury rehabilitation from China.

Design of guideline-based ChatGPT-agents (GPT-agents)

Guideline-based GPT-agents were designed based on GPT-4. The primary objective of an intelligent agent is to retrieve and provide word suggestions as answers. An evaluation was introduced for each of the steps mentioned above, resulting in five intelligent agents (Table 1). The first agent was responsible for the clustering analysis of the guidelines, extracting the topics and subtopics of each section, and then saving all of these extracted topics for later reference and retrieval. The second agent searched the inputted question within the subtopics, and the output was the question + the related content of medical guideline from the first agent. The third agents performed a “Matching Evaluation,” to check whether the question and the content were relevant. The fourth agent was question-answering agent which synchronously input the user's question and corresponding topic-related content into the GPT-4 model to generate the answer to the question. Finally, the fifth agent performed two functions: firstly, it evaluated the accuracy of the generated answer by comparing it with the contents of the guidelines, and secondly it produced the final response along with the relevant guideline content that corresponding to this response (Fig. 1A).

The program was deployed on the Kaggle platform (Kaggle.com), and OpenAI’s GPT-4 API was utilized for automated question answering. The program automatically recorded the number of words generated as well as the time consumed. The first agent responsible was categorization, which only ran once and did not participate in the answer-generation process. Therefore, time and words were not recorded for this agent. For the second and third agents, as their results mainly involved returning potential content from the guidelines and "True/False" answers, the words was not recorded as well.

The direct-GPT(GPT-4)

The direct question-and-answer design was based on GPT-4, utilizing the same environment as GPT-agents. Within the design, all questions were posed within a "for" loop (similarly to in GPT-agents group), and GPT-4 directly generated responses (Fig. 1B). The process recorded all the content, including the time consumed and the word count of the generated answers.

The medical guidelines

The references for TBI rehabilitation guidelines were obtained by searching a specialized Chinese database that collects all clinical guidelines and expert consensuses (Clinical Guidelines Network, https://guide.medlive.cn). Brain rehabilitation guidelines and standards were retrieved and thoroughly reviewed by a clinician (L.Z.Z.) with 14 years of clinical work experience. After clinical evaluation, the expert consensus20 that best aligns with Chinese TBI rehabilitation, was incorporated into the system to make it more comprehensive and inclusive of the content from traditional Chinese medicine.

Question data collection

First, 300 real-world brain rehabilitation-related questions from doctor-patient interactions were collected from online sources. Two medical experts (L.Z.Z and Z.W), both with over 10 years of clinical experience, who worked at the same Grade A tertiary hospital, manually collected 300 Chinese brain injury rehabilitation-related questions from two open-source Chinese medical dialogue datasets (https://github.com/Toyhom/Chinese-medical-dialogue-data, datasets/ FreedomIntelligence/huatuo_knowledge_graph_qa) and one website (https://youlai.cn/). Each question is accompanied by an answer, and the responses to these questions are publicly available. These questions cover the various stages of brain injury rehabilitation. Second, we randomly selected 30 questions to ask and evaluate using a computer method (code:random.choice(list,30)).

The inclusion criteria were as follows: (1) questions related to brain rehabilitation; (2) answers by medical experts available; (3) publicly available question-and-answer pairs without involving personal privacy; and (4) no copyright restrictions. The exclusion criteria were as follows: (1) inadequate responses prompting further hospital visits; (2) questions focusing on severe complications in vital organs such as the heart or kidneys; (3) unanswered questions by doctors; and 4) questions violating medical ethics or Chinese laws in questions or answers.

Evaluation for GPT-agents and GPT-4

The valuation team members included a chief physician (Z. J. F.), a senior physician (L. Z. Z.), and a nurse (X. R. Y.), all of whom had more than 10 years’ experience in clinical practice. The primary endpoint was better answers, whereas the secondary endpoint includes accuracy, completeness, explainability, and empathy.

First, a better evaluation of both answers (GPT-4 and GPT-agents) was required. Next, we evaluated the four sub-dimensions of accuracy, completeness, explainability, and empathy separately.

We developed a Likert scoring scale to evaluate the responses. To ensure accuracy, we referenced previous studies36 and adopted a continuous 5–0 rating system. The others were evaluated using a continuous 3–0 scale. A higher score signified strong agreement, whereas a score of 0 indicated strong disagreement (Supplementary Table 1).

Statistical analysis

Categorical data of the primary endpoint are presented as the number of cases and their respective rates. Comparisons between groups were performed using the chi-square or Fisher’s exact tests. Other measurement data for the normal distribution are presented as means ± standard deviations, and comparisons between groups was conducted using two independent sample t-tests. The measurement data for skewed distribution are presented as medians and quartile ranges. The level of statistical significance was set at p < 0.05. All statistical analyses were performed using GraphPad software (version 8). The time consumed and word count were displayed using Matplotlib in Python 3.10.