Introduction

Generative artificial intelligence (AI), driven by large language models (LLMs), has had a profound impact in all scientific disciplines. The impacts in biomedicine have spanned across clinical practice, research, and education1. In education, LLMs have been shown to score well above passing levels on medical board exams2,3,4, although until recently, none has compared scores directly with trainee test-takers on actual tests5. LLMs have also been found to perform comparably well with students and others on objective structured clinical examinations6, answering general-domain clinical questions7,8, and solving clinical cases9,10,11,12,13. They have also been shown to engage in conversational diagnostic dialogue14 as well as exhibit clinical reasoning comparable to physicians15. LLMs have had comparable strong impact in education in fields beyond biomedicine, such as business16, computer science17,18,19, law20, and data science21.

The successes of LLMs raise concerns about the future of student learning and assessment, particularly in the higher-education setting. LLMs may be good at providing answers, but they do not necessarily steer users (or students) to the original sources of knowledge nor assess their trustworthiness22,23. Another issue is the general tendency of LLMs to hallucinate or otherwise confabulate with stated confidence, potentially misleading students24. Others note that that LLMs might give students easy answers to assessments and undermine their learning and development of competence25,26,27,28. Among the concerns for LLMs are that users find their output competent, trustworthy, clear, and engaging, which may not be warranted29.

This study aimed to compare how LLMs perform on the assessments in one of the most widely taken online introductory courses in the field of biomedical and health informatics, a course taught at Oregon Health & Science University (OHSU) by one of the authors (WRH) for nearly three decades. The other author (KFH) has been a teaching assistant (TA) in the course for over the last decade. The course is offered to three different audiences using identical curricular materials and assessments:

  • Graduate students (BMI 510/610)—this course has been offered as part of what is now the health and clinical informatics (HCIN) major in the OHSU Biomedical Informatics Graduate Program. In addition to students in the HCIN major, students in other graduate programs (e.g., public health, nursing, biomedical basic science, etc.) can take this course as an elective in their programs.

  • Continuing education (AMIA 10 × 10)—starting in 2005, this course is known as 10×10 (“ten by ten”)30,31.

  • Medical students (MINF 705B/709 A)—beginning early in the COVID-19 pandemic, when medical education had to rapidly pivot to use of virtual learning, this course was offered as an elective for medical students and has continued due to student interest.

All offerings of the course are online. The major curricular activity is voice-over-PowerPoint lectures, with about three hours of lecture for each of the 10 units of the course. Additional readings using a textbook are optional. Students participate in threaded discussion in OHSU’s instance of the open-source Sakai learning management system (LMS) and are assessed with multiple-choice questions (MCQs), a final exam, and (for some—see below) a course paper.

On an academic quarter system, OHSU offers BMI 510/610 as a 10-unit course, with units released weekly. Because the AMIA 10 × 10 course is a continuing education course, the 10 units are decompressed and offered over 16 weeks. The medical student version of the course is offered as a two-week block (705B) or over an academic quarter (709 A). From 1996 through the winter quarter of 2024, 1683 students had completed BMI 510/610. From its inception in 2005 through the latest offering ending in early 2024, 3260 individuals had completed the 10×10 course.

Student learning in this course is assessed by up to three activities, depending on the audience:

  • MCQs: Each of the 10 units has an assessment of 10 questions per unit that is required for all students.

  • Final examination: The exam is required in BMI 510/610; optional in AMIA 10 × 10 for those wanting to obtain academic credit, usually to pursue further study in the field; and not required in MINF 705 A. Students are instructed to provide short answers of one sentence or less on the 33-question exam. The exam has historically been open-book so that students can focus on applying material and not memorizing it. As such, test-takers can consult materials on the LMS and the Internet, although are forbidden from contacting people.

  • Course project: A term paper of 10-15 pages is required for BMI 510/610, while a three-page paper is required for AMIA 10 × 10, and none is required for MINF 705 A/709B.

Overall student grading for each course is as follows:

  • BMI 510/610 is graded on a letter-grade scale. The final grade is weighted for the MCQs (30%), final exam (30%), student paper (30%), and class participation (10%).

  • AMIA 10 × 10 is a continuing-education course and graded on a pass-fail basis. Students completing the course can optionally take the BMI 510/610 final exam to get academic credit for the course, and a letter grade is assigned based on the final exam grade.

  • MINF 705B/709 A is graded (as with all OHSU medical school courses) on a pass-fail basis. Students are required to obtain an average of 70% across all of the MCQs and are not required to take the final exam or write a course paper.

The content of the course is updated annually and aims to reflect the latest research findings, operational best practices, government programs and regulations, and future directions for the field. The goal of the course is to provide a detailed overview of biomedical and health informatics to those who will work at the interface of healthcare and information technology (IT). The course also aims to provide an entry point for those wishing further study (and/or career development) in the field. It provides a broad understanding of the field from the vantage point of those who implement, lead, and develop IT solutions for improving health, healthcare, public health, and biomedical research. The annual updating is undertaken at the beginning of each calendar year, with the course materials rolled out in courses starting in the spring. An outline of the course content is listed in Table 1, with more detail provided in Supplementary Note 1.

Table 1 Biomedical and health informatics introductory course outline units with titles

In this study, we compared the knowledge-assessment results of students with those obtained by prompting several commercial LLMs and one open-source LLM as they would likely be used by higher-education students, i.e., through their interactive Web interfaces. The goal of the study was to assess how well these LLMs performed in a highly subscribed introductory course in biomedical and health informatics compared to realistic use by students Table 2.

Table 2 Minimum, 25th quartile, median, 75th quartile, maximum, and average scores for MCQs on each unit assessment and the final exam for each student group and all groups

Results

The 2023 version of the course was offered between Spring 2023 and Winter 2024. With the 2023 content, the course was completed by a total of 139 students, with 30 graduate students (BMI 510/610), 85 continuing students (AMIA 10 × 10), and 24 medical students (MINF 705 A/709B). The MCQs were answered by all students completing all courses, while all 30 BMI 510/610 students completed the final exam and 21 of 85 students opted to take the final exam in AMIA 10 × 10. The minimum, 25th quartile, median, 75th quartile, and maximum score are shown for MCQs on each of the unit assessments and the final exam for each student group and all groups combined in Table 1.

The output from the LLM prompts of the MCQs and final exam was graded by KFH and is shown in Table 3. ChatGPT Plus and CoPilot-Bing Precise tied for the highest average score on the MCQs, followed closely by Gemini Pro, Llama 3.1 405B, Mistral-Large, and Claude 3 Opus. On the final exam, Gemini Pro and Claude 3 Opus scored highest, followed by Llama 3.1 405B, CoPilot-Bing Precise, Mistral-Large, and ChatGPT Plus. Giving equal weighting to the MCQ average and the final exam, Gemini Pro scored the best overall. Figure 1 summarizes some key results, namely the MCQ averages and final exam results for students at the 25th, 50th (median), and 75th quartile of performance, along with Gemini Pro. Gemini Pro scored above the 75th percentile on 3 unit quizzes, equal to the 75th percentile on 1 unit quiz, and below the 75th percentile on 6 unit quizzes. Gemini Pro scored above the 50th percentile on 4 unit quizzes, equal to the 50th percentile on 4 unit quizzes, and below the 50th percentile on 2 unit quizzes. Gemini Pro scored above the 75th percentile on the final exam.

Table 3 Scores on each unit assessment and the final exam for all LLMs assessed
Fig. 1
figure 1

Unit assessments and final exam results for students and Gemini Pro. Summary of unit assessments and final exam results for all students at the 25th, 50th (median), and 75th quartile of performance (thinner green, orange, and blue lines respectively) with best-performing LLM, Gemini Pro (thicker black line).

The stopwatch times taken for each prompt for each LLM are shown in Table 4. Although there were substantial time differences among the LLMs, the time taken for all LLMs was minimal compared to the time taken by students. Although we have no data on time taken to complete MCQs, students are given up to 4 hours to complete the final exam, and the average time taken was 162 minutes (range 34–240). An observation of timing the LLM output was that it was most related to the amount of text each LLM printed to the screen, with some LLMs giving just answers and others providing text explanations of longer length and taking more time to display the text to the browser window.

Table 4 Clock time taken in seconds for each LLM for each assessment

The distribution of correct and incorrect answers for the LLMs on the final exam is shown in Fig. 2. Every LLM gave wrong answers on questions 19 and 23, the latter of which required students to calculate a Boolean expression. All LLMs answered 23 of the 33 questions correctly.

Fig. 2
figure 2

Correct and incorrect answers on final exam for all LLMs. Topics with correct (green) and incorrect (red) answers on final exam for all LLMs.

Discussion

This is, to our knowledge, the first assessment of LLMs based on a course in the biomedical domain where performance was compared with actual student results. In addition, the student assessment data comes from relatively large numbers of learners in three different types of educational programs—graduate, continuing education, and medical student.

All the LLMs performed well on course materials, with Gemini Pro performing best and Llama 3.1 405B, Claude 3 Opus and CoPilot Bing-Precise close behind. Gemini Pro scored at about the 75th percentile of all students who had taken the class between early 2023 and early 2024. Although the graduate and continuing education offerings of the courses have additional requirements for their complete grade, the performance of all of the LLMs was well above the passing levels for the MCQ and final exam components of the course. The clock time for the LLMs varied—mainly due to the amount of text printed to the browser window—but was far less than the time typically taken by students, e.g., up to four hours allowed for the latter to complete the final exam. An observation made when grading LLM final exams is that the LLM followed instructions for at most 2 sentence answers and rarely input one-word answers. In contrast, students usually vary the length of their answers and often give one-word answers. Another minor difference is that LLMs complete grammatically correct sentences and have correct spelling all of the time compared to some of the students not responding that way.

The results of this study raise significant questions for the future of student assessment in most if not all academic disciplines. Clearly LLMs can generate output at a high level for graduate-level courses such as introductory biomedical and health informatics. What are the options for maintaining the ability to assess students? One challenge for a course like this is that its focus and assessments are knowledge-based. The course does not develop or assess skills, but instead provides the knowledge and vocabulary for further skills development. This course might also consider, at least for the final examination, abandoning its open-book format.

Other options for maintaining the ability to assess students might be to develop more complex “Google-proof” questions for the assessments32. Some suggest the use of generative AI detectors, although a review of recent research found mixed ability to detect text coming from LLMs33. One concern for such detectors is their propensity to misclassify non-native English writing as generated by AI34.

The success of LLMs in educational tasks has implications beyond the student phase of education. If students are able to excel in classes due to generative AI, this may impact professional practice of graduates who have not necessarily mastered the foundational knowledge of fields in which they work. Assessments may be particularly problematic for adult learners who take mostly online courses asynchronously and cannot come to campuses for proctored exams. Indeed, Cooper and Rodman note that LLM use in medical education has “the potential to be at least as disruptive as the problem-oriented medical record, having passed both licensing and clinical reasoning exams and approximating the diagnostic thought patterns of physicians.”35 Mollick notes that educators face a “homework apocalypse” in simple prompting of LLMs being able to achieve passing or even better grades on assessments36.

There were a number of limitations to this study. First, we reviewed LLM performance in a single course and the results may not generalize to other graduate, continuing education, and/or medical student courses. Second, students after the November 2022 release of ChatGPT may have used generative AI themselves in the course, which could have had beneficial or detrimental effect on their performance. Third, since the biomedical and health informatics field evolves rapidly, including in but not limited to AI, how performance in courses on it is impacted in the long run by LLMs is unknown. Finally, there are reproducibility challenges for using industry-provided LLMs, although this is true for just about all studies using such LLMs, which undergo constant change and updating. We do not, however, believe that these limitations undermine our main results and conclusion, which is that LLMs scored at between the 50th and 75th percentile for a highly subscribed introductory biomedical informatics course.

In conclusion, we found that the best LLM system exceeded the performance of about three-quarters of graduate, continuing education, and medical students taking an introductory online course in biomedical and health informatics. Our results showed that LLMs are having a profound effect on education and its assessment. Certainly, LLMs will be part of the toolkit of professionals and academics in all disciplines. The challenge is how LLM use from the beginning of learning may impact mastery of competence and professional behavior later on. Future research must address these concerns to determine the optimal role of generative AI in all levels of education.

Methods

We compared student performance on MCQs and the final exam with 6 state-of-the-art LLMs: ChatGPT Plus (GPT-4), Claude 3 Opus, CoPilot with Bing-Precise, Gemini Pro, Llama 3.1 405B, and Mistral-Large. Use of de-identified aggregate student scores was determined by the OHSU Institutional Review Board (IRB) to be research not involving human subjects, with IRB review and approval not required (STUDY00026901). This enabled us to calculate the average grades for students in the different offerings of the course using the 2023 content. We used Microsoft Excel to calculate median and related scores.

To assess LLM performance, we used the latest versions of LLMs in their user-interactive modes since this was likely how most students would access them. Each LLM was prompted by a standard approach:

  • MCQs: Each LLM was prompted first with, “You are a graduate student taking an introductory course in biomedical and health informatics. Please provide the best answers to the following multiple-choice questions.” This was followed by pasting in the MCQs one unit (10 questions) at a time exactly as they appeared in the MCQ preview file in the Sakai LMS.

  • Final exam: Each LLM was prompted with, “You are a graduate student taking the final exam in an introductory course in biomedical and health informatics. Answer each of the following questions with a short answer that is one sentence or less.” This was followed by pasting in the exam, which had 33 questions, separated into 8 sections with a one-sentence heading for each section, exactly as it appeared in the Sakai LMS exam module.

    The LLM models used were prompted on the following days and times using their standard interactive interfaces:

  • ChatGPT Plus (GPT-4) on February 20, 2024 at 3 pm Pacific Standard Time (PST)

  • Gemini Pro on February 28, 2024 at 4 pm PST

  • Mistral-Large on March 1, 2024 at 3 pm PST

  • CoPilot with Bing-Precise on March 1, 2024 at 4 pm PST

  • Claude 3 Opus on March 10, 2024 at 6 am Pacific Daylight Time (PDT)

  • Llama 3.1 405B on August 16, 2024 at 1 pm PDT

We captured the text output from each of the LLMs, and these were manually graded by KFH and reviewed by WRH. The prompts and answer keys are provided in Supplementary Notes 2-5. Some sample questions are shown in Fig. 3.

Fig. 3
figure 3

Multiple-choice and final exams questions. Example multiple-choice and final exams questions used in this study.

The analysis had two small amounts of missing data unlikely to impact the overall results. Data from two of the 10 units for BMI 510/610 and AMIA 10 × 10 were not used for the last group of students (6 in BMI 510/610 and 43 in AMIA 10 × 10) taking the course in late 2023-early 2024 because some of the course content was updated requiring updating of the MCQs for those units. In addition, a configuration error in the Sakai LMS lost the individual but not aggregate quiz results for 6 students taking 705B/709 A in early 2024.