Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports

While radiologists can describe a fracture’s morphology and complexity with ease, the translation into classification systems such as the Arbeitsgemeinschaft Osteosynthesefragen (AO) Fracture and Dislocation Classification Compendium is more challenging. We tested the performance of generic chatbots and chatbots aware of specific knowledge of the AO classification provided by a vector-index and compared it to human readers. In the 100 radiological reports we created based on random AO codes, chatbots provided AO codes significantly faster than humans (mean 3.2 s per case vs. 50 s per case, p < .001) though not reaching human performance (max. chatbot performance of 86% correct full AO codes vs. 95% in human readers). In general, chatbots based on GPT 4 outperformed the ones based on GPT 3.5-Turbo. Further, we found that providing specific knowledge substantially enhances the chatbot’s performance and consistency as the context-aware chatbot based on GPT 4 provided 71% consistent correct full AO codes for the compared to the 2% consistent correct full AO codes for the generic ChatGPT 4. This provides evidence, that refining and providing specific context to ChatGPT will be the next essential step in harnessing its power.

Technical implementation of the chatbots.Data preparation and indexing.To develop and evaluate the proposed FraCChat chatbots, we relied on the AOOTA Fracture and Dislocation Classification Compendium-2018 as a foundational knowledge base.
We employed LlamaIndex as an interface between external data and the GPT 3.5-Turbo or GPT 4 (https:// github.com/ jerry jliu/ llama_ index Version 0.5.0).Text information was extracted from the classification compendium and an index was created using the GPTVectorIndex function of LlamaIndex.For this, the document text was divided into smaller chunks of up to 512 tokens (a measure of text content) and converted to data nodes.These nodes were then encoded using an embedding model (text-embedding-ada-002-v2 by OpenAI) and the result was stored in a dictionary-like structure.
Prompting strategy and answer synthesis.To customize the output of all chatbots for the case-based scenarios, the question posed to the system in each case was: "Examine the given fracture description and identify the AO Classification up to Subgroups.Begin your response always with "Full AO Code:" The direct output of GPT 3.5-Turbo and GPT 4, was captured as the response.For our context-based chatbot, a multistep answer creation and refinement method was used.As a first step, the case description was transferred to an embedding using the same embedding model (text-embedding-ada-002-v2 by OpenAI) and the five best matching data nodes from the index were retrieved.These nodes were used in the answer creation method using either GPT 3.5-Turbo or GPT 4. As a new input the task based on the given prompt, the case and the references where presented as new input to the language model with an overarching prompt to answer the given task based on the presented information.The final output was then captured.
Preparation of case files.The chatbots' accuracy in comparison to human performance was tested in a scenario resembling clinical routine.For this, 100 radiological reports describing fractures were created.From the more than 500 options of the AO classification (neglecting qualifications and universal modifiers), we randomly chose 100.resulted in 14 cases for humerus, 1 for scapula, 4 for clavicle, 9 for radius, 4 for ulna, 9 for femur, 3 for patella, 14 for tibia, 7 for tibia/fibula, 3 for fibula, 12 for pelvis, 12 for hand/carpus and 8 for the foot.Groundtruth AO codes for the case files comprised information on bone, segment and type of fracture in all cases while information on group was available in 73% and on subgroup in 56%.
The reports were then created by an experienced resident describing the fracture in a manner sufficient for classification.
Assessment of human and chatbot performance.The 100 case files were presented to two radiologists and classified independently according to the AO classification.During this assessment consultation of guidelines was allowed.
We utilized a script-based approach on the 100 case files to perform a fivefold repetition testing for all four chatbots.From this, the median of the performance of the five runs was calculated.
The performance of humans and chatbots was assessed through qualitative analysis of the different levels of the AO classification extracted from the full AO code: (I) full AO code, (II) bone location, (III) part of the bone, (IV) type of fracture, (V) group, and (VI) subgroup.To compare the respective outputs of humans and chatbots, we employed a Generalized Linear Mixed Model (GLMM) with a binomial family (logit link) to investigate the relationship between the dependent variable "rating" and the independent variable "method."The rating method (human, GPT 3.5-Turbo, GPT 4, FraCChat 3.5 and FraCChat 4) was included as a fixed factor, while allowing for random intercepts for the "case " and the "rater ".
The consistency of the chatbots was evaluated by comparing the proportion of correct answers that were provided consistently across all five runs versus inconsistently provided correct answers.For this, we calculated the ratios and employed one-sided chi-squared test to assess whether more consistent correct answers were provided than inconsistent ones.

Assessment of time-effectiveness of radiologists and chatbots.
To investigate time-effectiveness, we investigated the time to decision per human reader and chatbot and compared this duration via student's t-test.

Results
Human and chatbot performance.Both radiologists (with access to the AO classification documents) classified the reports according to the AO with high accuracy as they both provided the correct full AO code in 95% of the cases and even higher proportions of correct classifications were noted for the location, fracture type and group.
In contrast, the chatbots generally performed inferiorly as given in Tables 1 and 2. Here, the generic chatbots GPT 3.5-Turbo and GPT 4 only provided correct full AO codes in 5 and 7% of the cases.Better performance was noted for the FraCChat based on GPT 3.5-Turbo (57% correct full AO codes) while the FraCChat relying on GPT 4 reached 83% correct full AO codes.This was corroborated by the performance in the localisation, type and subtype were we noted a similar gradient, too, with mostly significantly better performance of FraC-Chat 4 over FraCChat 3.5 over GPT 4 over GPT 3.5-Turbo (please see Table 2 for more details).Of note, the performance of FraCChat 4 was not inferior to human rating regarding location, part of bone, fracture type and group as presented in Table 2.
Analysis of consistency revealed a higher proportion of consistently correct answers for the context-aware chatbots across all investigated levels of the AO code.Further, we observed that context-aware chatbots did provide significantly more consistently correct answers for the levels type and subgroup, while generic chatbots did not.Details are provided in Table 3.

Time effectiveness of radiologists and chatbots.
Duration to provide answers on the 100 case files was substantially lower for the generic chatbots than for human readers (generic chatbot mean 3.3 min vs. human mean 83.2 min; p < 0.001).In addition, generic chatbots were significantly faster than context-aware chatbots (generic chatbot mean 3.3 min vs. context-aware chatbot mean of 7.2 min; p < 0.001) and chatbots based on GPT 3.5 were both faster than the respective GPT 4 based one (both p < 0.001).

Discussion
In this study, we assessed the feasibility of generic and context-aware chatbots to provide AO fracture classification codes from radiological reports.Though the time needed to provide AO classification was significantly lower for chatbots than humans, we observed an inferior performance compared to human radiologists.However, we found that providing specific knowledge substantially enhances the chatbot's performance and consistency and not being inferior to human rating in terms of location, part of bone, fracture type and group.Various AI approaches have been proposed for fracture detection based on the image dataset, reaching to whole-body approaches as commercially available tools 10 .AI based methods are also available for image-based fracture classification, but mostly are limited to a particular region of the body.For example, for the ankle 11 , the distal radius 12 , the hip 13 and the knee 14 , convincing results were obtained by deep learning and the respective AO classifications derived directly from the image dataset.
Our approach follows a different track-human radiologists are easily capable of adequately describing a fracture in terms of its morphology and complexity.The translation of this description into the respective AO code classification was performed by an AI approach in this proof-of-concept work.Although we did not reach the performance of a human classifier on all levels of the AO system but only for location, part of bone, fracture type and group, we were able to show that chatbots based on LLM are very well suited for this purpose and might improve workflows as they are more time efficient than human radiologists.In addition, the provision of www.nature.com/scientificreports/specialized knowledge showed a clear over generic chatbots; not only the performance but also the consistency were substantially improved.
Previous studies employed natural language processing (NLP) to extract data from radiology reports.Good performance was found in NLP algorithms that were specifically trained to extract fracture presence 15 or fracture site 16 .Deep learning-based NLP methods using Convolutional Neural Networks have shown improved performance compared to classical NLP approaches.However, these approaches require rather large datasets as the performance scales with the amount of training data as shown for classification of hip fractures or smoking status 17 .In contrast to those specifically trained models, the method of providing specific knowledge via a vector index to a pretrained generic LLM comes with several advantages.The LlamaIndex interface offers a flexible data connector between an existing data source and LLM, structuring and optimizing data.Through this, the specific knowledge base can be exchanged swiftly and independently of the original LLM, removing the need of a task specific trained or fine-tuned model.Furthermore, the generated index is editable and individual references can be added or deleted (for example, if new guidelines are published or amended).In return, an existing index can be provided to a new LLM as content, avoiding the need for resource-consuming re-training of the AI algorithm itself.
The significantly higher performance of chatbots based on GPT version 4 compared with 3.5-Turbo is corroborated by recent studies on the simplification of radiological reports 7 or passing United States Medical Licensing Examination questions 4 .Further improvement in performance is expected with newer GPT models and better vector indices as we noted a significant increase in performance and consistency for GPT 4 based generic and context-aware chatbots over the GPT 3.5-Turbo based variants.This might also address the limitation we encountered with all chatbots in the case of the complex hindfoot fracture (AO 89A).Here, the chatbots mostly correctly classified the individual fractures of the talus and the calcaneus, but did not manage the translation to AO 89A.An interesting pattern was observed in the case file for AO 2U3A3.Despite the description of "multiple bone fragments", the chatbots incorrectly classified 2U2 whereas the human raters assigned the correct AO code.
Potential applications of the presented approach go beyond just supporting AO classification in clinical routine, but extent to retrospective analyses of existing reports and comparison with further procedures of the patients for quality control.In addition, the extension to other classification systems is also conceivable as well as providing information on evidence-based therapies with provision of the respective evidence in clinical routine.Nevertheless, this approach requires validation with real-world case files to test the generalizability.
In conclusion, we found that a context-aware chatbot can derive structured information on fracture classification from radiological reports.Though not reaching human performance, the time consumption was significantly lower.
We nevertheless provide evidence that refining and providing specific context to ChatGPT will be the next essential step in harnessing its power.Table 3. Number of correctly provided AO codes and proportion of consistently correct answers.Outputs were tested whether the proportion of consistently provided correct answers was higher than inconsistently provided correct answers by one-sided chi squared test.Results are given with * for p < 0.05 and *** for p < .001.

AO Code
In

Figure 1 .
Figure 1.Schematic of the workflow of the case file creation, indexing of the context-aware chatbot and performance analysis.AO/OTA-Arbeitsgemeinschaft Osteosynthese/Orthopaedic Trauma Association.

Table 1 .
Performance of human experts and chatbots in AO classification tasks and respective time consumption.

Table 2 .
Detailed comparison of the performance of humans and chatbots accross the different levels of the AO classification system.