Introduction

November 30, 2022, will go down in history as the date when a free version of the AI language model created by OpenAI called ChatGPT-3.5 (OpenAI, 2022) (in further text ChatGPT) was made available for public usage. This language model’s functions encompass text generation, answering questions, and completing tasks such as translation and summarization (Agomuoh, 2023).

ChatGPT can be employed as assistance in the world of academia. It can improve writing skills since it is trained to deliver feedback on style, coherence, and grammar (Aljanabi et al., 2023), extract key points, and provide citations (Aydin and Karaarslan, 2022). This could increase the efficiency of researchers, allowing them to concentrate on more crucial activities (e.g., analysis and interpretation). This has been supported by studies showing that ChatGPT could generate abstracts (Gao et al., 2023; Ma et al., 2023), high-quality research papers (Kung et al., 2023), dissertations, and essays (Aljanabi et al., 2023). Previous studies showed that ChatGPT could create quality essays on different topics (Hoang, 2023; Hoang et al., 2023; Nguyen and La; 2023; Nguyen and Le, 2023a, Nguyen and Le, 2023b, Susnjak, 2023). For example, this program, in conjunction with DaVinci-003, generated high-quality short-form essays on Physics, which would be awarded First Class, the highest grade in the UK higher education system (Yeadon et al., 2023). It also led to questions on the ethics of using ChatGPT in different forms of academic writing, the AI authorship (Bishop, 2023; Grimaldi and Ehrler, 2023; Kung et al., 2023; Pourhoseingholi et al., 2023; Xiao, 2023), and raised issues of evaluating academic tasks like students’ essays (Stokel-Walker, 2022; Whitford, 2022). Unavoidable content plagiarism issues were discussed, and solutions for adapting essay settings and guidelines were revised (Cotton et al., 2023; Hoang, 2023; Lo, 2023; Sallam, 2023; Stokel-Walker, 2022; Yeadon et al., 2023). A recent SWOT analysis of ChatGPT’s impact on education comprehensively analyzed all the mentioned issues. Strengths included advanced natural language generation, self-improvement, and personalized feedback, with potential benefits in information accessibility, personalized learning, and reduced teaching workload. Weaknesses encompassed limited understanding of the topic, inability to critically evaluate information, response quality evaluation challenges, bias risks, and a lack of higher-order thinking. Threats included contextual limitations, academic integrity risks, discrimination perpetuation, increased plagiarism, etc. (Farrokhnia et al., 2023).

As argumentative essays are one of the most advanced students’ tasks in higher education, and as such pose a challenge for students (Latifi et al., 2021), one of the ways where ChatGPT could be tested is essay writing. Such essays empower students’ ability to give an argument and build confidence in their knowledge preparing them not only for the academic environment but also for real-life situations (Valero Haro et al., 2022; Heitmann et al., 2014). A previous study showed that students need further development of argumentation competencies, as they demonstrated externalization issues with argumentation that did not differ if they worked in groups or individually. The results suggest that students experience problems in externalizing their argumentation knowledge both at the individual (argumentative essay) and collaborative levels (argumentative discourse), and that they need to further develop their argumentation competence (Banihashem et al., 2023a; Banihashem et al., 2023b; Kerman et al., 2023; Ranjbaran et al., 2023). However, it is still unknown how ChatGPT performs in students’ environment as a writing assistant tool and does it enhance students’ performance. Thus, this research investigated whether ChatGPT would improve students’ essay grades, reduce writing time, and affect text authenticity.

Materials and methods

We invited the second-year master’s students from the University Department of Forensic Sciences, to voluntarily participate in research on essay writing as a part of the course Forensic Sciences seminar. Out of 50 students enrolled in the course, 18 applied by web form and participated in the study. Before the experiment, we divided them into two groups according to the study module and the weighted grade point average (GPA) to ensure a similar composition of the groups. The control group (n = 9, GPA = 3.92 ± 0.46) wrote the essay traditionally, while the experimental group (n = 9, GPA = 3.92 ± 0.57) used ChatGPT assistance, version 2.1.0. (OpenAI, 2022).

We explained the essay scoring methodology (Schreyer Institute for Teaching Excellence (2023)) to both groups, with written instructions about the essay title (The advantages and disadvantages of biometric identification in forensic sciences), length of the essay (800–1000 words in a Croatian language), formatting, and citation style (Vancouver). We introduced the experimental group to the ChatGPT tool which included a brief explanation of the tool, and an example of entering the prompt about their essay-unrelated issue. They were instructed to use the tool freely, without any limitations (e.g., for creating a complete essay, for concept drafting, for specific topic-related questions, for corrections and suggestions, etc.). We did not demand students to submit the prompts they used and the responses they received. All students had four hours to finish the task and could leave whenever they wanted. The control group was additionally supervised to ensure they did not use the ChatGPT. The students’ names were coded to assure the individual and group anonymity and prevent grading bias.

Two teachers graded the essays (ŽB, associate professor, and IJ, assistant professor). The teachers compared the grades, and if their scoring differed the final grade was decided by the consensus. We used the essay rubrics from the Schreyer Institute for Teaching Excellence, Pennsylvania State University (http://www.schreyerinstitute.psu.edu/pdf/suanne_general_resource_WritingRubric.pdf), that included the following criteria (mechanics, style, content, and format) and grades from A to D (Schreyer Institute for Teaching Excellence (2023)). We converted categorical grades to numbers (A = 4, B = 3, C = 2, D = 1) for further analysis. For each student, we recorded writing time.

We checked the authenticity of each document using PlagScan (2022), and conducted the pairwise comparison for document similarity using R studio (ver. 1.2.5033) and package Textreuse (Mullen, 2015) using the Jaccard similarity index. We checked the content using an AI text classifier to test if a human or an AI created the text. According to this classifier, text was scored as very unlikely, unlikely, unclear, possibly, and likely that it was AI-generated (OpenAI, 2023). We opted for this package after similar programs (OpenAI, 2022; Goal and ChatGPT, 2023; Debut et al., 2023) did not recognize a ChatGPT-generated text in a non-English language as AI-assisted text.

Statistical analysis and visualization were conducted using Excel (Microsoft Office ver. 2301) and R Studio (ver. 1.2.5033). The final essay score was calculated as an average of four grading elements (mechanics, style, content, and format). The linear regression was used to test the effects of group, writing duration, module, and GPA on overall essay scores. The level of statistical significance was set at P ≤ 0.05.

Results

The duration of the essay writing for the ChatGPT-assisted group was 172.22 ± 31.59, and for the control, 179.11 ± 31.93 min. ChatGPT and control group, on average, obtained grade C, with a slightly higher average score in the control (2.39 ± 0.71) than in the ChatGPT group (2.00 ± 0.73) (Fig. 1A). The mean of text unauthenticity was 11.87% ± 13.45 in the ChatGPT-assisted group and 9.96% ± 9.81% in the control group. The text similarity in the overall sample was low (Supplementary Table 1), with a median value of the Jaccard similarity index of 0.002 (0–0.054). The AI text classifier showed that, in the control group, two texts were possibly, one likely generated by AI, two were unlikely created by AI, and four cases were unclear. The ChatGPT group had three possible and five cases likely produced by AI, while one case was labeled as unclear.

Fig. 1: Essay scores by variable for control and the ChatGPT-assisted group.
figure 1

A Average essay scores, B Duration and essay scores, C GPA and essay scores, D Text authenticity and essay scores.

Figure 1B, C implies a positive association between duration and GPA with essay scores. Students with higher GPAs in the control group achieved higher scores than those in the ChatGPT group. The association of essay scores and non-authentic text proportion (Fig. 1D) was detected only in the ChatGPT group, where the students with more non-authentic text achieved lower essay scores.

The linear regression model showed a moderate positive relationship between the four predictors and the overall essay score (R = 0.573; P = 0.237). However, none of the predictors had a significant effect on the outcome: group (P = 0.184), writing duration (P = 0.669), module (P = 0.388), and GPA (P = 0.532).

Discussion

As we are aware, this is the first study that tested ChatGPT-3.5 as an essay-writing assistance tool in a student population sample. Our study showed that the ChatGPT group did not perform better than the control group in either of the indicators; the students did not deliver higher quality content, did not write faster, nor had a higher degree of authentic text.

The overall essay score was slightly better in the control group, which could probably result from the students in the experimental group over-reliance on the tool or being unfamiliar with it. This was in line with Fyfe’s study on writing students’ essays using ChatGPT-2, where students reported that it was harder to write using the tool than by themselves (Fyfe, 2022). This issue is presented in the study of Farrokhnia et al., where the authors pointed out the ChatGPT weakness of not having a deep understanding of the topic, which, in conjunction with students’ lack of knowledge, could lead to dubious results (Farrokhnia et al., 2023). Students also raised the question of not knowing the sources of generated text which additionally distracted them in writing task (Fyfe, 2022). It is noteworthy that both groups obtained an average grade of C, which can be explained by other studies that argued that students’ writing lacks solid argumentation both when writing in general or when writing argumentative essays (Banihashem et al., 2023a; Banihashem et al., 2023b; Kerman et al., 2023; Farrokhnia et al., 2023; Ranjbaran et al., 2023). This demanding task could have been even more difficult when using ChatGPT, which could stem from several already mentioned issues like unfamiliarity when using ChatGPT and additional time requirements to link ChatGPT-created content and/or information with real literature sources.

Some studies did show more promising results (Hoang, 2023; Hoang et al., 2023; Nguyen and La; 2023; Nguyen and Le, 2023a, Nguyen and Le, 2023b, Susnjak, 2023; Yeadon et al., 2023), but unlike our study, they were mainly based on ChatGPT and experienced researcher interaction. This could be a reason for the lower performance of our ChatGPT group, as the experienced researchers are more skilled in formulating questions, guiding the program to obtain better-quality information, and critically evaluating the content.

The other interesting finding is that the use of ChatGPT did not accelerate essay writing and that the students of both groups required a similar amount of time to complete the task. As expected, the longer writing time in both groups related to the better essay score. This finding could also be explained by students’ feedback from Fyfe’s (2022) study, where they specifically reported difficulties combining the generated text and their style. So, although ChatGPT could accelerate writing in the first phase, it requires more time to finalize the task and assemble content.

Our experimental group had slightly more problems with plagiarism than the control group. Fyfe (2022) also showed that his students felt uncomfortable writing and submitting the task since they felt they were cheating and plagiarizing. However, a pairwise comparison of essays in our study did not reveal remarkable similarities, indicating that students had different reasoning and styles, regardless of whether they were using ChatGPT. This could also imply that applying the tool for writing assistance produces different outcomes for the same task, depending on the user’s input (Yeadon et al., 2023).

The available ChatGPT text detector (Farrokhnia et al., 2023) did not perform well, giving false positive results in the control group. Most classifiers are intended for English and usually have disclaimers for performance in other languages. This raises the necessity of improving existing algorithms for different languages or developing language-specific ones.

The main concern of using ChatGPT in academic writing has been the unauthenticity (Cotton et al., 2023; Susnjak, 2023; Yeadon et al., 2023), but we believe that such tools will not increase the non-originality of the published content or students’ assignments. The detectors of AI-generated text are developing daily, and it is only a matter of time before highly reliable tools are available. While our findings suggest no immediate need for significant concern regarding the application of ChatGPT in students’ writing, it is crucial to acknowledge that this study’s design reflects real-life situations of using ChatGPT as a convenient and rapid solution to submit assignments, potentially at the expense of the overall quality of their work. This issue remains an important consideration when assessing the broader implications of our study.

The main drawback of this study is the limited sample size (9 per group) which does not permit the generalization of the findings or a more comprehensive statistical approach. One of the limitations could also be language-specificity (students wrote in native, non-English language for their convenience), which disabled us from the full application of AI detection tools. We should also consider that ChatGPT is predominantly fed with English content, so we cannot exclude the possibility that writing in English could have generated higher-quality information. Lastly, this was our students’ first interaction with ChatGPT, so it is possible that lack of experience as well as inadequate training in using AI language models also affected their performance. Therefore, it is crucial to exercise caution when generalizing these findings, as they may not necessarily reflect the experiences of a broader range of ChatGPT users, who often report rapid draft generation. Future studies should therefore expand the sample size, number, and conditions of experiments, include students of different profiles, and extend the number of variables that could generally relate to writing skills. Also, it would be useful to conduct a study that would analyze the quality and depth of the students’ prompts to ChatGPT, as it seems that the question type and the feedback provided by the user could remarkably affect the final result (Farrokhnia et al., 2023).

However, the academia and media concern about this tool might be unjustified, as, in our example, the ChatGPT was found to perform similarly to any web-based search: the more you know—the more you will find. In some ways, instead of providing structure and facilitating writing, it could distract students and make them underperform.