Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination

The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions—English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer’s metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.


Objective
In this paper, we hence aimed to investigate the utility of GPT-3.5 and GPT-4 in the context of the Polish Medical Final Examination in two language versions-Polish and English.By evaluating the LLMs' performance on the examination and comparing it to real medical graduates' results, we seek to better understand their potential as a tool for medical education and clinical decision support as well as an improvement of the GPT technology which comes with the newest version of the model.We also aimed to evaluate the influence of the temperature parameter on the models' responses in terms of questions from the medical field.

Materials and methods
Polish Medical Final Examination (Lekarski Egzamin Końcowy, or LEK, in Polish), which is necessary to complete medical education under Polish law and to pass to apply for the license to practice medicine in Poland (and based on the Directive 2005/36/EC of the European Parliament also in European Union).The examination questions are prepared by the director of the Medical Examinations Center in cooperation with representatives of medical universities in Poland.Each participant can choose the language of the examination (Polish or English).The English version of the examination is translated from the original Polish one.The exam is a test comprising 200 questions with 5 options to choose from and only a single correct answer.In order to pass a test, it is required to obtain at least 56% of correct answers 22 .The examination contains questions regarding: (1) internal diseases, including cardiovascular diseases-39 questions, (2) pediatrics, including neonatology-29 questions, (3) surgery, including trauma surgery-27 questions, (4) obstetrics and gynecology-26 questions, (5) psychiatry-14 questions, (6) family medicine-20 questions, (7) emergency and intensive care medicine-20 questions, (8) bioethics and medical law-10 questions, (9) medical jurisprudence-7 questions, (10) public health-8 questions.
As both models were trained on the data until September 2021 it was decided to evaluate their performance on 3 editions of the Polish Medical Final Examination-Spring 2022 (S22), Autumn 2022 (A22), and Spring 2023 (S23) in two versions-Polish and English.All questions from the previous editions of the examination are available online, along with the average results of medical graduates, detailing overall results, results of graduates who took the exam for the first time, those who graduated in the last 2 years, and those who graduated more than 2 years ago 23 .Besides the content of the question, the correct answers and answer statistics like the index of difficulty (ID) and discrimination power index (DPI) were published.Those indexes were calculated according to the equations presented below 24 : where n is the number of examinees in each of the extreme groups (27% of the participants with the best results and 27% with the worst results in the entire test), Ns-the number of correct answers to the analyzed task in the group with the best results, Ni-the number of correct answers for the analyzed task in the group with the worst results.The index of difficulty takes values from 0 to 1, where 0 means that the task is extremely difficult and 1 means that the task is extremely easy.The discrimination power index assumes values from -1 (for extremely badly discriminating tasks) to 1 (for extremely well discriminating tasks).
For both models, application programming interface (API) provided by OpenAI was used in order to accelerate the process of obtaining answers with gpt-3.5-turboand gpt-4-0613 models 25 .This allows for providing prompts to GPT models using programming languages and automating the process of obtaining the responses.The analysis was performed with the temperature parameter set to 0 and 1 with the top_p parameter always set to 1 (default one), as altering both parameters is not recommended 26 .Temperature parameter influences the randomness of the text generated by the model with lower values of this parameter indicate more focused and deterministic responses and higher values make the model's responses more random and creative.Prompts sent through API were the exact questions from the examination without additional comments or context.From each response, the final answer was obtained and saved to the Excel file.If the answer was ambiguous, then the given question was treated as not answered (in other words-incorrectly answered).In the case of GPT-4, questions were taken as an input to the prompt of the models or with API using gpt-4-0613 model.Final answers from all prompts were stored in Appendices 1 and 2 for GPT-3.5 and GPT-4 respectively.
The accuracy of both models for each test was calculated by dividing the number of correct answers by the number of all questions, which had the correct answer provided.As some questions were invalidated due to inconsistency with the latest knowledge, there were no correct answers for these questions, thus the number of correct answers was divided by the number smaller than 200.Questions which contained image were also excluded.Moreover, the Pearson correlation coefficient between the correctness of the answers, the index of difficulty and the discrimination power index were calculated, and the Mann-Whitney U test was conducted to investigate the if there is a difference in those indexes for correct and incorrect answers and Cohen's d was used to establish effect size (0.2 -small, 0.5 -medium, 0.8 -large) 27 .The overall scores for each examination obtained by LLMs were also compared to the average score obtained by medical graduates who took the exam in the given editions.Consistency of responses depending on the language of the test was also validated by calculating the number of the same answers for each examination.The responses of the models with temperature parameters equal to 0 and 1 were also compared using the Mann-Whitney U test for all examination editions and both languages.All questions were asked between the 29th of March and the 14th of August 2023 (ChatGPT March 23 version).The significance level was set at the level of 0.05.For the usage of API, calculations, statistical inference, and visualizations Python 3.9.13 was used.In the process of composing this paper, the authors leveraged Grammarly (Grammarly, Inc.), and GPT-4 to enhance the manuscript's linguistic quality and rectify any grammatical inaccuracies.After employing these tools, the authors reviewed and edited the content as required, thereby accepting complete accountability for the publication's content.

Results
GPT-3.5 managed to pass 2 out of 3 versions of examination in Polish in terms of temperature parameter equal to 1, and failed in all versions when this parameter was equal to 0, while passing all versions in English regardless of the temperature parameter.GPT-4 was able to pass all three versions of the examination regardless of language and temperature parameter used.The detailed results obtained by both models are presented in Tables 1 and 2 and visualized in Figs. 1 and 2 for the temperature parameter equal to 0 and 1 respectively.
There was a statistically significant positive correlation between the correctness of the answers and the index of difficulty as well statistically significant difference between the index value for correct and incorrect answers Table 1.Number of correct answers of GPT-3.5 and GPT-4 for each of the undertaken examinations for temperature parameter equal to 0. In the brackets, the number of questions with the given answers and percentage accuracy are provided next to the exam version and the number of correct answers respectively.in the case of all three exams for both models, temperature parameters and languages, except from Polish version of S23 and GPT-3.5 with temperature set to 0. Cohen's d for the difference in the index values for correct and incorrect answers indicated a large effect size (> 0.8) in the case of GPT-4, except the English version of A22 for temperature equal to 1, when it was moderate.In the case of GPT-3.5 effect size varies from small to moderate.GPT-4 always obtained a higher value of Cohen's d compared to GPT-3.5.There was also a statistically significant negative correlation and difference between the correctness of the answers and discrimination power index in the case of the A22 in Polish for both models with both temperature values but only for GPT-3.5 in the case of English version and S23 (only for Polish version) exams for both models.The effect size was small for most cases ranging from 0.026 to 0.690.The results are presented in Tables 3 and 4 for the index of difficulty and Tables 5 and 6 for discrimination power index, for temperature parameters equal to 0 and 1 respectively.The boxplots of the index values depending on the correctness of the answers were visualized in Figs. 3 and 4 for the index of difficulty, and Figs. 5 and 6 for the discrimination power index for temperature parameters equal to 0 and 1 respectively.GPT-4 had a higher number of questions with the same given answer regardless of the language of the examination compared to GPT-3.5 for all three versions of the test.The agreement between answers of the GPT models on the same questions in different languages is presented in Tables 7 and 8 for temperature parameters equal to 0 and 1 respectively.There was no statistically significant difference between the results obtained for the same tests and models but with different temperature parameters.In Table 9 the comparison of the results for different temperature parameter values is presented.

Discussion
GPT-4 consistently outperformed GPT-3.5 in terms of the number of correct answers and accuracy across three Polish Medical Final Examinations.It indicates a vast improvement in the scope of medical knowledge represented by the GPT-4 model compared to the previous version.For both versions of the model, there is a statistically significant correlation between the accuracy of the answers given and the index of difficulty.Assuming that this index represents the difficulty of the medical issue raised in the given question, as the index is calculated based on the number of correct responses of the best and worst performing participants, it might indicate a lack of in-depth medical knowledge.Additionally, a statistically significant negative correlation and Table 3. Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann-Whitney U test comparing the values of the index of difficulty for correct and incorrect answers for temperature parameter equal to 0.

S22 A22 S23
Polish GPT-3.5 version and temperature equal to 0. Students who graduated less than 2 years before the examination consistently outperformed both GPT models in both languages.The consistency of the answers between different language versions of the test was much higher for GPT-4 than for GPT-3.5.On average, the most recent model returned identical answers across test languages in 84.3%/83.6% of instances (temperature equal to 0 and 1 respectively), compared to GPT-3.5's 65.8%/58.1% consistency.This highlights the improvement of the ability of the GPT-4 model to interpret text and encode the knowledge contained in the dataset on which the model was trained.On average, GPT-3.5 exhibited a 9.4% and 1.6% higher accuracy in answering English questions than Polish ones for temperature parameters equal to 0 and 1 respectively.On the contrary, GPT-4 showed a 1.0% higher and 0.2% lower accuracy in Polish over English for temperature parameters equal to 0 and 1 respectively, which contrasts with the evaluation on the Massive Multitask Language Understanding (MMLU) benchmark, where accuracy in Polish was 3.4% lower than in English 6 .The lack of a statistically significant difference between the model results obtained for the temperature parameter equal to 0 and 1 suggests that in this range the value of a given parameter affects rather the overall creativity of the response, but not the representation of medical knowledge encoded in the model.Moreover, there was a notable difference in the responding style between the models as  also highlighted the relationship between the correctness of the answers given by the LLM and the difficulty of the questions, which was also reported in our results.Study performed by Mihalache et al. presented that GPT-3.5 performed the best in the general medicine questions, while obtaining the worst results in the specialized questions 28 .Bhayana et al. demonstrated that GPT-3.5 exhibited superior performance on questions that required low-level thinking compared to those which require high-level thinking 29 .Moreover, the model struggled with questions involving the description of imaging findings, calculation and classification, and applying concepts.Recently, Google and DeepMind presented their LLM PaLM 2 and its medical domain-specific finetuned Med-PaLM 2 30,31 .The performance of GPT-4 and MedPaLM 2 on USMLE, PubMedQA, MedMCQA and MMLU appears to be very similar, where both GPT-4 and MedPaLM 2 were superior to each other in an equal number of tests evaluated.In this comparison, it is worth noticing that GPT-4 is a general-purpose model and was not explicitly finetuned for the medical domain.
There may be several potential reasons for the imperfect performance and providing incorrect answers by the tested models.First of all, both models are general-purpose LLMs that are capable of answering questions from various fields and are not dedicated to medical applications.This problem can be addressed by fine-tuning the models, that is, further training them in terms of medical education.As was shown in other studies, a finetuning  Table 7.The number of questions on which models provided the same answer regardless of the test language for temperature parameter equal to 0. In brackets, the number of correct answers with the same response is presented.

Figure 1 .
Figure 1.Comparison of the performance of both models along with passing score and average medical graduate score for all three examinations for temperature parameter equal to 0.

Figure 2 .
Figure 2. Comparison of the performance of both models along with passing score and average medical graduate score for all three examinations for temperature parameter equal to 1.

Figure 3 .
Figure 3. Boxplots of the index of difficulty for the correct and incorrect answers for all three versions of the examination and both languages for temperature parameter equal to 0.

Figure 4 .
Figure 4. Boxplots of the index of difficulty for the correct and incorrect answers for all three versions of the examination and both languages for temperature parameter equal to 1.

Figure 5 .
Figure 5. Boxplots of the discrimination power index for the correct and incorrect answers for all three versions of the examination and both languages for temperature parameter equal to 0.

Figure 6 .
Figure 6.Boxplots of the discrimination power index for the correct and incorrect answers for all three versions of the examination and both languages for temperature parameter equal to 0.

Table 2 .
Number of correct answers of GPT-3.5 and GPT-4 for each of the undertaken examinations for temperature parameter equal to 1.In the brackets, the number of questions with the given answers and percentage accuracy are provided next to the exam version and the number of correct answers respectively.

Table 4 .
Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann-Whitney U test comparing the values of the index of difficulty for correct and incorrect answers for temperature parameter equal to 1.
significant difference assessed with the Mann-Whitney U test were found between the correctness of the answers and the discrimination power index for almost all models, languages and temperature parameters in the A22 (except from difference for GPT-4 model and English version) and for all settings in the S23 (only Polish version) exams, which might be a sign of the simplicity of the model's reasoning or the ability to simplify tasks in terms of the medical questions.In all versions of the test, GPT-4 scored slightly below medical student averages, which was equal to 84.8%, 84.5%, and 83.0% for S22, A22 and S23 respectively, except for S23 with temperature parameter equal to 0, where GPT-4 obtained 83.5%.The latest GPT version outperformed students who graduated over 2 years ago for S23 (mean score 156.65) for both languages for temperature values equal 0 and only Polish for temperature equal 1 and those taking A22 as their first exam (mean score 159.57) in case of Polish

Table 5 .
Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann-Whitney U test comparing the values of the discrimination power index for correct and incorrect answers for temperature parameter equal to 0.

Table 6 .
Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann-Whitney U test comparing the values of the discrimination power index for correct and incorrect answers for temperature parameter equal to 1.