Background

Dictionary-making: harmless but protracted drudgery

The task of writing dictionary entries—famously termed harmless drudgery (Johnson, 1755)—is a highly time-consuming and labour-intensive process. Major dictionary projects, such as for large general academic dictionaries, can take decades to complete, rather than just a few years, and require the involvement of substantial teams of people. Well-known examples include:

  • The Oxford English Dictionary, whose first edition took over 70 years (1857–1928) to complete (Mugglestone, 2008);

  • Das Deutsche Wörterbuch, edited between 1838 and 1961, thus for more than 120 years (Haß-Zumkehr, 2011);

  • The Woordenboek der Nederlandsche Taal (WNT), first begun in 1864 and completed—no less than 144 years later—in 1998 (De Schryver, 2005; Van Sterkenburg, 1984).

These staggering timelines—often exceeding the human lifespan—stem from the fact that the production of dictionary entries is a meticulous and detail-oriented task that requires a great deal of time, effort, and expertise that comes with experience and training. The process involves, in broad outline, the identification, description, and illustration of lexical units of a language, their usage and meanings, as well as the classification and organization of this information into a coherent and accessible format. The sheer scale of the task and the complexity of language make it a daunting and challenging undertaking, requiring a significant investment of resources and human labour.

Writing high-quality dictionary entries has generally been seen as a rare and specialized skill that requires a unique combination of personal qualities and specialized training. Beyond skills specific to lexicography, lexicographers should possess a broad knowledge of the language, and meticulous attention to detail. They must have an acute sense of language structure, as well as the ability to categorize and organize information in a way that makes it easily accessible to others.

Artificial Intelligence as an end to the drudgery?

In light of these challenges, any efforts aimed at automating the work of dictionary compilation reducing the time and resources required are relevant. The path to replacing human labour with automatic processes is aptly captured in Rundell (2023) and leads up inevitably to the use of Artificial Intelligence, also the focus of the recent De Schryver and Joffe (2023).

The question I wish to address in this contribution is whether the use of Large Language Model-based tools, such as GPT, could be used to significantly streamline the process of writing dictionary entries. Perhaps more importantly, can this be done while still ensuring sufficiently high levels of accuracy and quality? Towards the end of the previous millennium, Grefenstette (1998) famously asked: “Will there be lexicographers in the year 3000?”. Suddenly, the year 3000 seems to be closer than we have thought. Before I proceed to assessing the suitability of ChatGPT as a COBUILD lexicographer, let us briefly explain what made this lexicography project stand out in the crowd.

What is special about COBUILD

The COBUILD project is widely seen as a breakthrough in modern learner (and not just learner) lexicography, as it has introduced a number of innovations (Sinclair, 1987). First, it relied on a corpus of digital texts as primary empirical evidence for the meaning and use of words in a way that no dictionary had done before. Second, it subscribed to a view of language that unified meaning and structure, as opposed to the then mainstream separation of lexis and grammar. This implied a view of dictionary senses based on the association of meaning with specific usage patterns. Third, as a consequence of this view, for a substantial proportion of entries, COBUILD adopted a defining format known as the Full-Sentence Definition (henceforth: FSD). The COBUILD FSD format typically included two clauses, the first of which illustrated the usage pattern of the sense, while the second clause offered a paraphrase of meaning (Barnbrook, 2002; Hanks, 1987). Let us illustrate this format (2) by comparing it with a traditional dictionary definition (1) of a sense of the verb approve.

  1. (1)

    to think that someone or something is good, right, or suitable

  2. (2)

    If you approve of someone or something, you like and admire them.

A traditional or classical definition as in (1) is designed to be, as far as possible, substitutable for the word defined. In terms of meaning, this says that a definition should be a faithful paraphrase of the sense defined. In terms of structure, the principle means that a noun headword should be defined by a noun phrase, a verb by a verb phrase, etc., so that—in principle at least, if not always in practice—the definition can replace the headword in running text without making it blatantly ungrammatical. One of the ways in which the COBUILD project was revolutionary was that it broke away from the substitutability orthodoxy, and instead proposed to rely to a substantial degree on a more conversational format inspired by how non-lexicographers might explain the meaning of words to others, such as teachers to students (for further discussion of defining styles, see e.g. Fabiszewski-Jaworski, 2012; Hanks, 2005).

When it comes to illustrative examples in entries, COBUILD may also be described as innovative, in that it adopted (at least to some extent), a corpus-driven approach, which places great value on the authenticity of illustrative material. Before the advent of text corpora, lexicographers routinely relied on invented examples to illustrate the meanings and uses of words. However, in time it became increasingly apparent that invented examples would often be artificial and would violate norms of native language use (Hanks, 2009). To address this, lexicographers sought to use authentic examples of language actually used by native speakers, which were becoming gradually more available in large text corpora.

The benefits of using text corpora notwithstanding, the sentences found there often presented challenges for language learners. These sentences would abound in unclear references to co-text, proper names, and rare words that could be confusing or distracting to language learners. As a compromise between completely authentic and invented examples, modified sentences from corpora was another option (for a discussion, see e.g. HumblĂ©, 2001). Around 2008 (Kilgarriff et al., 2008), and as a response to the growing size of corpora, which offered a growing pool of textual material that could potentially be used as examples, the GDEX (=Good Dictionary Examples) approach originated, whose aim was to identify the ‘best’ examples by sorting concordance lines based on a constellation of adjustable parameters.

The emergence of Large Language Models may offer another alternative when it comes to the provision of examples for lexicography. The probabilistic nature of these models might be expected to result in natural word choices that are authentic-like.

Study

Aim

The aim of this study was to see to what extent a Large Language Model (LLM) would successfully emulate the performance of a human lexicographer in terms of compiling COBUILD-like entries for English language learners. Specifically, the focus was on assessing the quality of AI-generated sense definitions and example sentences in comparison to those crafted by human lexicographers for the same headwords. An additional aim was to test the utility of expert lexicographer feedback for fine-tuning LLM performance.

Material

Selection of target headwords

The class of lemmas selected for this study was verbs of communication, as categorized in the English Vocabulary Profile (2023) (Capel, 2015). Verbs in general exhibit the richest complementation of all syntactic categories, and verbs of communication tend to have non-trivial syntactic patterns and pragmatic uses that may be a challenge in lexicographic description. Further, the set of candidate target headwords was restricted to those that do not have senses at CEFR levels lower than B1 (again, as described in the English Vocabulary Profile), and do not have major senses unrelated to communication. This narrowed down the choice to 23 items, of which 15 were selected for the study.

AI-generated entries

These were generated in a single interactive session on February 14th, 2023, via the web interface to OpenAI ChatGPT Plus. This subscription-based service, at the time of writing charging 20 US Dollars per month, promises uninterrupted service and faster text generation, as well as earlier access to new features. The main motivation behind the decision to subscribe was to avoid service unavailability due to overload, which at the time happened quite often. One unexpected consequence of switching to the premium subscription service was that the Plus service apparently (by their own claim) had no access to chat history before the switch. That is not necessarily bad for the present design, as it ensures starting with a clean slate, as it were.

The ChatGPT Plus web interface was trained with the GPT-3.5 model selected, in a process known in the field as few-shot training (Brown et al., 2020), which was given exclusively via the general interactive text interface, but without any fine-tuning of the underlying parameters or hyperparameters. In our case, the ChatGPT was presented with two example entries from the current version of COBUILD Online [https://www.collinsdictionary.com/dictionary/english]. The two entries presented as examples were APPROVE and ASSASSINATE. Note that the second example word is not a verb of communication. It was included as a good example of a passive pattern being the dominant one. The presentation of the two examples was preceded by a two-paragraph prompt setting out the purpose of the exercise as follows:

The idea is to have clear dictionary entries for English verbs, designed for advanced learners of English (though not native speakers) that at the same time explain meaning and illustrate usage. The novelty is that the way the words are used is conveyed not just through example sentences (usually three per sense), but already in the structure of the definition, which is written as a two-clause sentence, and the verb defined is embedded in a typical context. All this is supported with full-sentence examples. The definitions mostly use fairly simple words.

I will now illustrate these principles with an example entry for the verb ‘approve’.

The two examples for APPROVE and ASSASSINATE followed next:

approve [VERB]

  1. 1.

    If you approve of someone or something, you have a favourable opinion of them or it.

    I approve of his plans for the future.

    She approves of her children’s friends.

    They approve of the new law.

  2. 2.

    If you approve a plan, proposal, or idea, you give your official permission or endorsement to it.

    The board of directors approved the proposal.

    The government approved the construction of a new highway.

    The school approved the students’ request for a field trip.

  3. 3.

    If you approve a payment, loan, or expense, you authorize it to be made.

The bank approved the loan for the new business.

The manager approved the expense for the company trip.

The committee approved the payment for the consultant.

When introducing the latter entry, attention was drawn to the passive construction, and the COBUILD convention for the use of the generic ‘you’ in definitions was mentioned:

I will now present another example, and then ask you to produce similar entries for further verbs. Please note in this next example how the definition also uses the passive construction because the verb defined (“assassinate”) is predominantly used in the passive. At the same time, it avoids using the generic “you”, so as not to imply that the reader is guilty of socially unacceptable (here, indeed, criminal) acts.

assassinate [VERB]

1. When someone is assassinated, they are killed deliberately, especially for political or religious reasons.

The president was assassinated by a gunman.

The political leader was assassinated during the revolution.

The ambassador was assassinated in a foreign country.

2. If you assassinate someone’s character or reputation, you damage or destroy it by making false or malicious statements.

The politician’s opponent tried to assassinate his character.

The rumours assassinated her reputation.

The newspaper article assassinated the company’s image.

This last example entry was followed up with further instruction as below:

Do you see how sense one has been defined using the passive? Also, this entry has only two senses. There may be two, three, four senses, as many as you believe are reasonable for a proficient learner of English. You do not need to treat very technical or specialized uses that are restricted to particular professions or rare text genres.

Entries from COBUILD

Entries for the COBUILD versions as well as for examples used in ChatGPT prompts were drawn from the current version of Collins COBUILD Advanced Online (2023) presented on the Collins English Dictionary website [https://www.collinsdictionary.com/dictionary/english], usually as the first dictonary entry under the headword, also accessed on February 14th, 2023. A typical COBUILD entry starts with the headword (or lemma), then includes information on pronunciation, inflected forms, and syntax codes that convey complementation patterns of the headword. COBUILD syntax codes as well as listings of inflected forms were all left out from test entries. However, I established in an earlier session with ChatGPT 3.5 that the chatbot did not have any problems generating these two additional microstructural components. Informal testing also indicated that ChatGPT is able to produce good-quality phonemic and phonetic transcriptions, but I did not investigate this aspect in a systematic fashion here.

Quality evaluation

To evaluate the quality of the entries (definitions and example sentences) produced by ChatGPT-4, human experts were enlisted to assess these entries alongside original COBUILD entries (see above) for the same headwords. The evaluation process was conducted via programmatically generated Microsoft Excel workbooks. Experts assigned ratings from dropdown menus in the Excel spreadsheet, using a closed set of pre-defined scale labels. For each entry, experts assigned a rating for the sense definition, sense exemplification (typically comprising three example sentences), as well as the entry as a whole. In each case, a five-point scale was used with the following verbal labels: Bad, Wanting, Passable, Good, Great. The sequence of entries was randomized separately for every expert, with no indication as to whether the particular entry was AI-generated or produced by a human lexicographer, nor were experts informed of the origin of the entries prior to the task. In addition, there was space for optional open-ended comments on definitions, examples, and entries. The Excel files were shared and returned via email.

Four invited experts agreed to participate and complete the evaluation, and they all agreed to disclose their identities [*note to reviewers: experts may be identified by name following the review process]. Two of the experts had been part of the original COBUILD team themselves (Michael Rundell and Elizabeth Potter), while the other two had not, but had done original research and published in the area of the format and language of entries for learners of English (Sylwia Wojciechowska and Reinhard Heuberger). All experts returned their ratings promptly and were then de-briefed about the purpose of the study and origin of the entries. They also received ad hoc summary statistics of how they rated AI-generated versus human-written definitions, examples, and entries. At that time, they also received their evaluation sheets, but now with information on the creator of each entry revealed.

Data analysis

The data were analyzed quantitively and qualitatively using the R environment for statistical computation (R Core Team, 2022). Expert ratings for some analyses were converted to numerical values from 1 to 5, which were then used in mixed-effects models. I also computed mean ratings and a numerical measure of human advantage as the difference between the ratings given to COBUILD entry components and AI-generated entry components for the same headword. Qualitative assessment of the free-text comments from experts was also assisted with R tools. All data are available as Supplementary Online Material.

I will first present the results of the evaluation, starting with the ratings, then open-ended comments. Finally, I will report the outcome of the second round of training, whereby critical insight from experts was fed back to Chat GPT to see if this might improve performance by eliminating some of the problems noted.

Results and analysis

The results will be presented in two parts: the first part will focus on expert ratings of the entries and will be quantitative in nature, while the second part will look at the content of the open-text comments, specifically focusing on the drawbacks of the AI-generated content, as reported by the experts.

There are several sources of variability behind the ratings. Obviously, there is our central design factor, the Creator of the entry: human lexicographers from the COBUILD team versus the not-so-human ChatGPT. Then, each expert has their own criteria, preferences, as it were a private concept of the ideal definition, example, and entry: there will then be variability due to Expert. Further, some words are inherently harder to treat lexicographically, and the lexicographers, be it human or otherwise, may make a better or worse job in a particular case. This is variability due to Word. Finally, there will always be some residual variability that cannot be explained away using the above factors.

Ratings of AI-generated and human-written COBUILD entries: observed values

This section will present descriptive values of ratings from the four expert raters assessing the quality of definitions, examples, and entries.

To begin with, Table 1 gives the overall central measures (variability will be treated in more detail below) for observed (i.e., raw reported) expert ratings of definitions, examples, and entries overall, created (or, in the case of COBUILD corpus-based examples, perhaps curated) by either human COBUILD lexicographers or ChatGPT. Mean values are calculated by converting the five verbal labels to integers from 1 to 5, respectively. The median value is the middle Rating for a particular combination of Element and Creator, whereas the Mode is the most frequent Rating for a particular combination. As can be seen, the Median and Modal values happen to be the same in all instances.

Table 1 Central summary measures for expert ratings of definitions, examples, and entries created by AI and COBUILD lexicographers.

This first rough look, averaging over all words and experts, suggests that AI-generated definitions were rated similarly to original COBUILD definitions, both earning an average rating of Good, and with only a small difference (a third of a grade) in favour of human creators. However, AI-generated examples received average ratings nearly a grade lower than those created by human lexicographers. In terms of verbal labels, COBUILD examples were Good on average, while those from ChatGPT were only Passable. For entries overall, the ‘human advantage’ amounted to about half a grade by mean value. This makes good sense given that entries comprised definitions and examples, so we would expect experts to give entries ratings that were a compromise between those for definitions and examples, plus perhaps additional elements such as sense division and entry organization.

Following this broad first look, let us bring into the picture the four individual experts and fifteen entry headwords. In addition, for each data point, let us calculate the difference between COBUILD and AI-created elements, extracting what may be loosely yet vividly termed human advantage. Figure 1 plots this for each word and each expert separately.

Fig. 1: Differences between ratings of COBUILD and AI-created content for the same expert, headword, and entry element.
figure 1

The individual differences represent the advantage of human lexicographers over AI.

Figure 1 confirms the initial impression that definition is where the human advantage is smallest, but it adds to the picture an indication of the degree of variability by word, and by expert. It is clear from the plot that experts do not always agree, and sometimes disagree quite dramatically in their ratings, as for the entry for the word insist. At the same time, it seems that the human advantage does not hold for at least some of the definitions. To make this somewhat clearer, consider a different view in Fig. 2, where the individual data points for experts have been replaced with a dot at mean rating difference (across four experts) plus the range of rating differences. Recall that these rating differences represent human advantage as assessed by (human) experts. In addition, the words have been re-arranged by the human advantage of definition.

Fig. 2: Mean rating differences (human advantage) across experts.
figure 2

Ranges are shown for individual headwords and the three entry elements, arranged by mean human advantage for definition.

Looking at the left-most panel, we can draw the conclusion that for six out of the fifteen headwords, the sense definitions proposed by ChatGPT are at least as good as those written by human COBUILD lexicographers, by an average judgement of four experts. Further, for all words except one (confirm), there was at least one expert who did not note a human advantage in the definition.

The results are rather more in favour of human COBUILD lexicographers when it comes to entries overall, and even more so for example sentences. In the next section, I will pursue the best hierarchical model that fits the data and that appropriately controls for variation due to Word and Expert, entering these as random factors in the model.

A hierarchical model of how human- and AI-created entries are rated

We started by checking model assumptions, then proceeded from the maximal model with a complete random structure and drilled down to the optimal model not significantly worse than any of the more complex models (Zuur et al., 2009). Sparing the reader the model selection sequence (available on request), BIC, AIC, and AICc all pointed to the same optimal random model structure. The final model formula in lmer syntax (Bates et al., 2015), was as follows:

$${{{\mathrm{Rating}}}} \sim {{{\mathrm{Element}}}} \,* \, {{{\mathrm{Creator}}}} + \left( {1 + {{{\mathrm{Creator}}}}\left| {{{{\mathrm{Rater}}}}} \right.} \right) + \left( {1 + {{{\mathrm{Creator}}}}\left| {{{{\mathrm{Word}}}}} \right.} \right),$$

where Rater marked the identity of the rating expert. The model formula means that I allow ratings to depend on the Creator (AI versus COBUILD), the Element rated (definition, examples, or entry overall), as well as an interaction between the two (allowing for example that ChatGPT churns out good definitions but poor examples). In addition, I allow some experts to be more lenient than others, and some headwords as inherently easier to treat lexicographically (random intercepts). Further, the model also allows some experts to be systematically more critical of AI-generated content than that produced by humans, or vice versa (recall that they did not know, however, which content was created by who until de-briefing after the exercise). Finally, some words may be more difficult to describe for AI but not for human lexicographers, or vice versa (random slopes).

The final model was fit by REML and produced a fit with random and fixed effects as shown in Fig. 3 below.

Fig. 3: Random and fixed effects of the final selected hierarchical model.
figure 3

The model formula is: \({{{\mathrm{Rating}}}}\sim {{{\mathrm{Element}}}} \ast {{{\mathrm{Creator}}}} + \left( {1 + {{{\mathrm{Creator}}}}\left| {{{{\mathrm{Rater}}}}} \right.} \right) + \left( {1 + {{{\mathrm{Creator}}}}\left| {{{{\mathrm{Word}}}}} \right.} \right)\).

The magnitudes and t values of the fixed effect estimates suggest that ratings for AI-generated definitions were not in general significantly different from their COBUILD counterparts. However, the gap between AI and COBUILD for examples as well as for ‘entry overall’ was significantly bigger than for definitions. A plot of predicted ratings with 95% confidence intervals for this model and fixed effects for the selected model is given in Fig. 4, whereas Fig. 5 shows predicted ratings with 95% prediction intervals that include the random factor of Rater (=expert).

Fig. 4: Predicted ratings by Creator and Element.
figure 4

The figure includes 95% Confidence Intervals.

Fig. 5: Predicted ratings by Creator and Element for each expert Rater.
figure 5

The figure includes 95% Prediction Intervals.

Free comments

Apart from Likert-scale ratings, experts also had space for optional comments on the entries. Overall, there were 135 comments from the four experts, which corresponds to a third of the 400 combinations of sense and expert. The more critical experts tended to offer more comments, with a maximum of 43 comments and a minimum of 21 comments. A slight majority of the comments (82, or 61 percent of all comments) were on AI-generated entries, a pattern that was also true of three out of the four experts. The one expert who gave relatively lower ratings to COBUILD entries also had more critical comments on those entries. In general, comments were offered to point to problems rather than give praise, so highly rated entries and senses usually did not receive comments.

AI-generated entries were most often criticized for their redundancy/repetitiveness: this was a repeated comment from one expert (‘Redundancy—examples repeat exactly the same pattern’). Examples were also described as ‘unnatural’; less commonly as ‘clunky’, ‘dull’, ‘unconvincing’, and ‘simple’ (‘last example is unconvincing, looks unnatural’; ‘examples are dull and samey’; ‘clunky past-tense examples’). Some definitions were described as ‘wordy’ (‘definition is on the right lines but more wordy than it needs to be’). Sometimes the negative word was accompanied by a positive adjective (‘examples clear but clunky’; ‘nice clear defs, clear but dull examples’). One specific point that recurred in comments from two experts was the unvaried use of the past simple tense in examples (‘only past forms of the verb in the examples’). Experts also occasionally suggested the splitting or merging of specific senses. Figure 6 shows a word cloud of the lexical descriptors found most commonly in expert comments, after filtering out irrelevant stop words (function words, references to microstructural elements, sense numbers, etc.).

Fig. 6: Word cloud from expert comments on AI-generated entries.
figure 6

Font size reflects comment frequency.

Entries from COBUILD also received comments, likewise mostly critical (Fig. 7). Experts had issues with repetitious and wordy definitions (‘def is wordy, examples don’t fit def’). Only one use of ‘clear’ was in an unqualified positive comment (‘nice clear def, good examples’); elsewhere the adjective was either negated (‘not at all clear’; ‘rather rambling definition, not at all clear’) or used to soften a negative statement, usually ‘long’ or ‘wordy’ (‘wordy but clear’). Similarly, only one of four uses of ‘good’ was unreservedly positive; in other cases, it was mixed (‘1 and 2 are good, 3 is not’; ‘Most of the examples are good; the first one looks less likely’). ‘Helpful’ always went with ‘wordy but 
’.

Fig. 7: Word cloud from expert comments on COBUILD entries.
figure 7

Font size reflects comment frequency.

An attempt at fine-tuning the examples

As seen from the previous sections, it was AI-generated examples that were found the least satisfying compared to their COBUILD counterparts, both in terms of quality ratings and free-text comments. It is conceivable that this is at least partially due to sub-optimal training of ChatGPT, during which only very limited model data was presented to the chat bot to generalize from.

In order to see if follow-up training that takes advantage of the expert feedback received could help mitigate the problems identified, I fed a summary of the feedback back to ChatGPT, using the newer Model 4, version May 12th. I asked it to revise an entry for encourage, which received very low expert ratings for examples. For this purpose, I prompted the system with the following passage:

In another thread, GPT-3.5 generated dictionary entries for English verbs for learners of English modelled on the Cobuild dictionary. These entries have now been evaluated by several experts in lexicography. They liked the definitions but thought the example sentences to illustrate the use of the headword could be improved. Please revise the following entry for the verb ENCOURAGE as described below. The entry consists of two senses, each with a definition and three examples of use:

(1) do not change the definitions: keep them exactly as they are;

(2) make example sentences within each sense less similar to one another: do not just use the past simple tense, do not always start with “the”; one or two of the three examples could be a bit more elaborate than the barebones subject - verb - complement;

(3) do remember that each example should still illustrate the meaning and syntactic pattern of the verb ENCOURAGE as given in the definition for this sense. Taking these comments into account, please revise your last entry for the word ENCOURAGE, which was:

encourage [VERB]

  1. 1.

    If you encourage someone to do something, you give them support, confidence, or motivation to do it.

  • The coach encouraged the team to do their best.

  • The teacher encouraged the student to study harder.

  • The parent encouraged the child to try again.

  1. 2.

    If something encourages you, it gives you support, confidence, or motivation to continue or to do something.

  • The good grade encouraged the student.

  • The sunny weather encouraged the hikers.

  • The success encouraged the entrepreneur.

Following this prompt, the system returned the following entry as the response, on the first try, with no extra introductions, explanations, or comments:

encourage [VERB]

  1. 1.

    If you encourage someone to do something, you give them support, confidence, or motivation to do it.

  • In a gentle voice, Mary encouraged her shy little brother to speak up in class.

  • Despite the initial failure, Rob’s friends encouraged him to try starting his own business again.

  • “Never stop believing in yourself,” the coach said, encouraging the team before the big match.

  1. 2.

    If something encourages you, it gives you support, confidence, or motivation to continue or to do something.

  • Finding her first novel on the bestseller list greatly encouraged the young author to continue writing.

  • With the sun shining brightly, the hikers felt encouraged and decided to take the more challenging trail.

  • His early success as an entrepreneur encouraged him, instilling the belief that he could turn any idea into a successful business.

These revised example sentences appear to be moving a long way towards the expectations of the experts. They are now more elaborate and varied, use less boilerplate structures, and exhibit more variation in grammatical tense, although the simple past tense still dominates.

The revised entry was also sent to the same experts alongside the originally generated entry, asking for their opinion on the quality of the revision. All four experts responded, and all characterized the revision as a significant improvement on the originally generated entry.

Discussion, conclusion, and limitations

Our results indicate that ChatGPT is capable of producing dictionary definitions emulating COBUILD style that are practically indistinguishable in quality from those written by highly trained human lexicographers. This is a ground-breaking finding, and one that challenges the centuries-long lexicographic orthodoxy that sees definition-writing as a rare elite skill that requires extensive training (and perhaps also unique inborn qualities). Obviously, such generative systems are much faster and cheaper to use than humans (assuming the training is not just done so as to produce a single entry, but a larger number), even if humans were still to be used for quality-checking or post-editing.

The quality of original examples generated by ChatGPT turned out not to be as impressive, and significantly worse than those crafted by professional human lexicographers. However, the fine-tuning follow-up session suggests that this may be a matter of improving the instruction set passed to the transformer (prompt engineering). With some fine-tuning using insights from expert feedback, the system seemed capable of generating example sentences that are both authentic-sounding and accessible.

I have not explored in this contribution issues such as sense identification or other aspects of entry organization. These are aspects that need to be addressed systematically before we can claim that AI-generated entries are as good as those produced by professional human lexicographers. The ratings for entries overall collected in this study appear to be affected by the relatively poorer performance of ChatGPT with examples (before the fine-tuning session), but occasional suggestions addressing sense organization also appear in the free comments, mostly to either split or merge senses (on this topic, see e.g. Kilgarriff, 1997; Lew, 2022).

While we clearly need further work to more fully gauge the potential of AI and LLM in lexicography (but see De Schryver & Joffe, 2023), including studies with human dictionary users, this study has shown that such systems can perform very well, even with minimal instruction, in emulating a specific defining style, and could thus potentially take over some of the most laborious aspects of monolingual lexicography. In light of this, the use of Large Language Models to generate examples for lexicography deserves further consideration and exploration.