Abstract
English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4’s estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants’ estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.
Similar content being viewed by others
Introduction
English has a rich vocabulary of probabilistic phrases used to communicate the relative frequency or likelihood of events (We will use the term probability as a catch-all for relative frequency, likelihood, degree of uncertainty and probability. Following O’Brien1 we use the term ambiguity as a synonym for imprecision or ambiguity).Your doctor might tell you that you will probably feel better after a night’s sleep. What probability or range of probabilities do you think he intends to convey? What if instead he used the probabilistic phrase possibly? In the latter case the doctor is likely signaling a lower probability, but did he intentionally choose a more ambiguous term? The doctor may have reasons for introducing ambiguity. Perhaps he is not sure what effect a good night’s sleep will have. Perhaps he wants to manage the patient’s expectations.
The coordination game. In using these phrases, you, your doctor, and other English speakers you encounter are playing a simple coordination game2,3,4,5,6,7 in which both players win to the extent that the probabilities and ambiguities the second player estimates are close to those the first player intended.
Human language comprises many different language games8,9,10 and the successful speaker must play them all well. Our goal here is to compare human and machine performance in one of many language coordination games. Performance in the game measures how well any given player–doctor or patient—uses language as a tool8,9,10. Success in playing the game would suggest that the two players in effect share an understanding of the language of probability and ambiguity that allows them to coordinate and work together well.
A coordination game is an objective task and we can compare human and machine performance trial by trial. Failure would pinpoint shortcomings in the performance of the machine. Can a computer play the game as well as humans do? Do successes and failures in playing the game mimic those of humans?
There is disagreement on how to assess the performance of Large Language Models such as GPT-411,12. We follow the recommendations of Burnell et al.13 in our assessment: (1) We use multiple related tasks (probability and ambiguity). (2) The tasks are objective with degree of failure measured quantitatively. (3) We report exactly one run of GPT-4—the first—in each experimental condition, not an aggregate of multiple runs. (4) We look at how GPT-4 performs in the two contexts, giving medical advice and giving investment advice.
Lastly, (5) the tasks we consider—communicating information about probability and ambiguity—are intrinsically important. There is a large literature concerning human error in decision making14, failures in probabilistic reasoning15, and the consequences of these errors and failures16,17.
The capabilities of GPT-4 and other candidate Artificial General Intelligences have been compared to human with mixed results. GPT-4 fails simple intelligence tests18. On the other hand, Webb, Holyoak, and Lu19 report that GPT-4’s ability to engage in analogical reasoning and abstract pattern induction is comparable to human. Gurnee & Tegmark20 find that it can reason about spatial and temporal structure. GPT-4 can do more than chat it can write simple computer code for applications specified in natural language21.
The probabilistic phrases we consider (Table 1) have been used in previous research with human participants1. We modify any probabilistic phrase as needed so that its use in context is grammatical (e.g. possible can become possibility). In Fig. 1 we illustrate one turn of the coordination game as a communication channel22,23. For simplicity we assume that, in the coordination game, the first player has only one probability and one ambiguity to signal and the second player is constrained to report a single estimate of probability and one of ambiguity.
In the full coordination game (Fig. 1), Player 1 is given a target probability (for example, 63%) and must encode it as one of the probabilistic phrases in Table 1. Perhaps she picks likely. This probabilistic phrase is transmitted to Player 2 who must decode it and estimate the target probability. In Fig. 1 she estimates 70%. The difference (7%) in absolute value between Player 1 ‘s target probability, 63%, and Player 2 ‘s estimate, %70, is the error, a measure of failure of coordination in the coordination game.
We will focus on the second stage (DECODE) of the coordination game (outlined in red in Fig. 1), evaluating GPT-4’s performance as Player 2 and comparing GPT-4’s performance to that of human participants also playing as Player 2. That is, GPT-4 and its human counterpart will be asked to DECODE probability phrases and estimate corresponding probabilities. We will also ask GPT-4 and human players to rate the ambiguity (imprecision) on a scale of 0–100 of the 23 probabilistic phrases they decode. All ambiguity estimates were done after all probability estimates by both human participants and GPT-4. Probabilistic phrases were presented in randomized order to both human participants and GPT-4.
We emphasize that, from the viewpoint of Player 2 (the human participant or GPT-4) the game is the coordination game illustrated in Fig. 1 in which Player 2 believes he is trying to coordinate a choice of probability with Player 1. In actuality the role of Player 1 is played by the experimenter who provides Player 2 with probabilistic phrases in a predetermined randomized order but Player 2 does not know that. We collect and analyze data from only Player 2, the part of Fig. 1 marked with a dashed red contour. By having the researcher play the role of Player 1, we ensure comparability of input between human and GPT-4: Player 2 saw probabilistic phrases in an order determined by the experimenter alone, not the experimenter and an uncontrolled Player 1.
The essence of a 2-person coordination game2 is that the two players must each anticipate what the other player is thinking, and the anticipation is recursive: “I know that you know that I know ….”. Our game captures that essence. Player 2—human or GPT-4—is asked to see himself through the eyes of a doctor or financial advisor who is trying to communicate information about uncertainty to him.Can an LLM do this as well as a human?
There are previous studies whose participants were asked to assign explicit probabilities to probabilistic phrases1,24,25,26,27,28,29,30,31. See32 or33 for review.These studies assess the extent to which humans—for the most part without any special training—agree with each other in their use of probabilistic phrases to signal probability. If all the speakers in a language community assign the same probabilities to probabilistic phrases, then the players would do very well at the coordination game.
Questions
1. Can GPT-4 play the coordination game as well as humans? Are there patterned deviations between human and machine? We needed a criterion to judge whether GPT-4, playing as Player 2, is doing what a human player would. We developed two criteria, the first based on linear model fits, the second on performance.
The first criterion is based on fitting a linear model to bivariate scatterplot data as we explain below. They give us clues about shortcomings of GPT-4 even when GPT-4 overall plays the game well as measured by the second criterion. We do not claim that the linear model is an adequate model of the mapping from GPT-4’s estimates of probability or ambiguity to the corresponding human estimates. The fitted parameters serve as summary statistics intended to aid in interpreting the data.
Second, we evaluate how well human and GPT-4 coordinate. We develop a measure—discordance—of the extent to which human players disagree with one another in playing the game and compare this measure to the discordance between GPT-4 and human players. Does GPT-4 perform as well in the coordination game as the median human player?
There is more to language competence than assigning probability estimates and ambiguity ratings, but systematic failure to coordinate with human participants in our game would weaken any claim that GPT-4’s abilities are human-like. We could not trust an Artificial General Intelligence to give medical advice if the probability phrases it uses were not correctly understood by human patients.
2. Is GPT-4 correctly sensitive to context? The meanings of words can depend on the context in which they occur. If your doctor and your financial consultant both use the phrase not certain, does it signal the same probability? Ambiguity? We will include two contexts in the experiment, medical and financial, and ask participants, including GPT-4, to rate the 23 probabilistic phrases in Table 1 for probability and for ambiguity in each context.
.Humans distort small and large probabilities33. A doctor may well avoid probabilistic phrases near the extremes of the probability scale (e.g. “almost certain”) precisely because he knows his patients will distort them. Or he might bias his choice of words to counteract the expected bias of his patients. Similarly, if the Player suspects the motives of a financial adviser he might try to “debias” his estimates.
Keep in mind that the issue is not whether GPT-4’s assessment of probability and ambiguity is invariant under context but whether GPT-4 exhibits the same changes or lack of change in probability and ambiguity ratings across context as do the human participants. In a coordination game it doesn’t matter whether you are right, only whether you agree with everyone else.
3. Is GPT-4 stable? Lastly, we briefly investigate the stability of GPT-4 in this game. If we rerun the estimates by GPT-4, do we get series of similar estimates or a series of similar estimates with the occasional highly discrepant estimate? Is GPT-4 stable? We might hesitate to permit an Artificial General Intelligence to give medical advice if 1 time out of 100 it produced markedly discrepant estimates of probability or ambiguity. The motivation for testing stability will become clear when we examine the data.
Results
The results are presented in three numbered sections corresponding to the numbered questions above. We split the first question in two, one part (1A) concerned with probability, one with ambiguity (1B).
Human vs. GPT-4: probability
In Fig. 2 we plot the median probability ratings assigned to each of the 23 probabilistic phrases by the 25 human participants against the GPT-4 ratings of each of the 23 probabilistic phrases. Figure 2a shows results in the Investment Context while Fig. 2b shows results for the Medical Context. The ratings range from 0 to 100%. If the median human participant agreed with GPT-4 in rating probability the plotted points would fall on the dashed blue identity line. The letter codes correspond to the letter codes assigned to each probabilistic phrase in Table 1. An intercept significantly different from 0 or a slope significantly different from 1 would indicate a patterned discrepancy between GPT-4 and the median human participant. We test for both possibilities.
There is an evident outlier in Fig. 2a for the probability phrase “low risk” plotted in red. There is a similar outlier in Fig. 2b for the probabilistic phrase “not certain”. The outliers represent probabilistic phrases where GPT-4 and the median human participant assigned markedly different probabilities to the same probabilistic phrase. In the main text we report statistical analyses for this and later figures without these outliers. All results of hypothesis tests—with and without outliers—are included in a Supplement. We discuss outliers further in the section Stability.
We refer to tests with p-values less than 0.05 as “significant” for convenience in presenting the data. We report exact p-values for all tests in the main text and report exact p-values for all tests (including those with and without outliers removed) in the Supplement.
Intercept The Intercept estimate in Fig. 2a is 1.59, not significantly different from 0 [t (22) = 0.421, p = 0.339]. The Intercept estimate in Fig. 2b is 11.35, significantly different from human 0 [t (22) = 3.458, p = 0.0011].
Slope The Slope estimate in Fig. 2a is 0.833, significantly different from 1 [t (22) = − 2.577, p = 0.009]. The Slope estimate in Fig. 2b is 0.825, also significantly different from 1 [t (22) = − 2.933, p = 0.0038].
Summary There are significant patterned differences between median human probability estimates and those of GPT-4. In both contexts median human estimates of probability are compressed by a factor of 0.8 relative to the estimates by GPT-4. In the Medical Context but not in the Investment, human estimates of probabilities are also offset vertically by roughly 10%. Human use of probability and relative frequency are typically distorted34,35 and the deviations we detect may be connected to probability distortion.
Discordance. Both the human participants and GPT-4 are engaged in a coordination game and the second criterion of human and machine is trial by trial winnings in the game. Did GPT-4 disagree with the other human human players more than they disagreed with one another?
We define a measure of the disagreement between the probability or ambiguity estimates of the human participants. Let pi = [pi1,…pim] be the vector containing the m = 23 probability estimates of the ith human participant in the order of Table 1. We define the discordance of the ith human participant to be.
where \(\left\| {p_{i} - p_{j} } \right\| = \sqrt {\sum\limits_{k = 1}^{m} {\left| {p_{ik} - p_{jk} } \right|^{2} } }\) denotes the Euclidean distance between pi and pj. The discordance of a participant is just the sum of the squared distances between the vector corresponding to the participant and each of the vectors corresponding to the remaining participants. It can be zero only if all the participants give identical estimates for all probability phrases.
Let pGPT denote the vector of probability estimates made by GPT-4 and define the discordance of GPT-4 to be.
There are m − 1 summands in Eq. (1) and m in Eq. (2). The multiplicative term m−1/m in Eq. (2) corrects for the difference in the number of summands in the two equations.
Were GPT-4’s judgments of probability more discrepant from those of the human participants than those of the human participants were from one another?
Figure 3a is a boxplot of all the discordance values for the Investment Context, one blue dot per human participant. The discordance values are plotted vertically, and the red horizontal line marks the median discordance in each context. The lower and upper edges of the box mark the 25th-percentile and the 75th-percentile, respectively. Figure 3b is the corresponding plot for the Medical Context. The red diamonds mark the discordances of GPT-4 in the two contexts. The discordance of GPT-4 is below the median of the discordances for the humans for both Contexts. GPT-4 agreed with the human participants as as least as well as they agreed with one another.
Human vs. GPT-4: ambiguity
In Fig. 4 we plot the median ambiguity ratings of the 25 human participants to the GPT-4 ratings of each of the 23 probabilistic phrases. Figure 4A shows results in the Investment Context while Fig. 4B shows results for the Medical Context. The letter codes once again correspond to the letter codes assigned to each probabilistic phrase in Table 1.
Intercept The Intercept estimate in Fig. 4A is 4.17, not significantly different from 0 [t (23) = 0.729, p = 0.237]. The Intercept estimate in Fig. 4B is 5.82, not significantly different from 0 [t (23) = 1.286, p = 0.106].
Slope The Slope estimate in Fig. 4A is 0.530, significantly different from 1 [t (23) = − 4.83, p < 0.0001]. The Slope estimate in Fig. 4B is 0.710, also significantly different from 1 [t (23) = − 3.458, p = 0.001].
Summary There are significant patterned differences between median human confidence estimates and those of GPT-4. In both contexts median human estimates are compressed, by a factor of 0.5 to 0.7 relative to the estimates by GPT-4.
We did not analyze discordance for ambiguity since there are evident large differences in estimation of ambiguity by GPT-4 and the median human participant.
Comparisons across context
We next evaluate the extent to which human judgments of probability and ambiguity are invariant across context. In Fig. 5A we plot the median probability ratings of the human participants in the Investment Context versus the median probability ratings for each of the 23 probabilistic phrases of a different group of human participants in the Medical Context. A similar plot for GPT-4 is shown in Fig. 5B. Figure 6A,B show corresponding plots for ambiguity.
Intercept The Intercept estimate in Fig. 5A is − 0.147, not significantly different from 0 [t (23) = − 0.041, p = 0.484]. The Intercept estimate in Fig. 5B is − 2.41, not significantly different from 0 [t (23) = − 0.608, p = 0.275].
Slope The Slope estimate in Fig. 5A is 0.907, is not significantly different from 1 [t (23) = − 1.41, p = 0.0865]. The Slope estimate in Fig. 5B is 0.977, not significantly different from 1 [t (21) = − 0.341, p = 0.368].
Summary There are no significant patterned differences between median human probability estimates in the Investment Context and the Medical Context. There are no significant patterned differences between estimates by GPT-4 in the Investment Context and the Medical Context.
Intercept The Intercept estimate in Fig. 6A is 0.048, not significantly different from 0 [t (23) = 0.021, p = 0.492]. The Intercept estimate in Fig. 6B is 8.42, significantly different from 0 [t (23) = 2.214, p = 0.019].
Slope The Slope estimate in Fig. 6A is 0.819, significantly different from 1 [t (23) = − 3.46 p = 0.0010]. The Slope estimate in Fig. 6B is 0.941, not significantly different from 1 [t (23) = − 0.341, p = 0.207].
Summary There are significant patterned differences between ambiguity estimates by GPT-4 in the Investment Context and the Medical Context.
Stability
The two outliers in GPT-4’s performance raise issues concerning the stability of GPT-4. We chose to examine the outlier in Fig. 2B (the probabilistic phrase not certain) to determine whether it reliably recurs (representing a large but reliable discrepancy between human and GPT-4 estimates) or whether it is evidence of instability. If it reliably reoccurs then it is effectively a difference of opinion between human and machine as to the meaning of a particular probabilistic phrase. If not, it would suggest that GPT-4 is unstable.
GPT-4 included explanations for its responses. We tabulate these explanations for the outlier in Fig. 2A (low risk) in the Investment Context and the outlier in Fig. 2B (not certain) in the Medical Context. The reader may agree with GPT-4 or not, but GPT-4’s response acknowledges that the probabilistic phrases can be interpreted in more than one way and perhaps human and machine are simply in disagreement.
Will the outlier recur if we rerun the trial? The GPT-4 interface that we have access to limits the number of runs that we can carry out in a fixed period of time, precluding analyses that require large numbers of repetitions of trials.We redid the GPT-4 estimates in the Medical Context four times, plotting the estimates as four blue contours in Fig. 7. The original estimates are plotted in red with a red solid circle marking the outlier. In brief, we did not reproduce the anomalous outlier we initially encountered nor did other outliers emerge for any of the other probabilistic phrases. The four new estimates of probability are in good agreement with those of the human participants and with one another but not with the original GPT-4 estimate.
Discussion
In coordination games we share information to coordinate actions5,6. The specific coordination game we consider here concerns correct use of probabilistic phrases signaling probability and ambiguity. There were two versions of the game, one with probabilistic phrases used to give investment advice, one with these same phrases used to give medical advice. Half the human participants ran in the Investment Context, half in the Medical.
Conclusions: Estimates of Probability
-
As measured by discordance, GPT-4 agreed with human participants as least as well as the median human participant agreed with the other participants. Based on overall performance we cannot distinguish GPT-4 and human (Fig. 3).
-
Examined in detail (linear model fits), we found significant patterned discrepancies between GPT-4’s estimates of probability and those of human participants that could be captured by fits to a linear model (Fig. 2). In both contexts, human estimates of probabilities tended to be compressed relative to those of GPT-4.
-
Use of probabilistic phrases by both GPT-4 and humans signaling probability transferred well across contexts for the two contexts we considered (Fig. 5). A doctor’s use of likely conveys the same information about probability as that of an investment consultant.
Conclusions: Estimates of Ambiguity
Human estimates of ambiguity were compressed relative to those of GPT-4 by roughly a factor of 2. However, unlike probability, there is no standard scale of ambiguity. We can only conclude that GPT-4 did not anticipate human use of the ambiguity scale, a failure to coordinate.
There is some indication that GPT-4 is unstable, producing occasional outliers. Further research is needed to evaluate this apparent instability (Table 2).
We focused on one coordination game and compared human and machine. Similar games could be based on color terms or sets of dimensional adjectives36, for example, the dimensional adjectives describing size: small, big, large, etc36,37. Gurnee & Tegmark20 look at representation of space and time.
But human use of probability phrases is a particularly rich source of possible coordination games that we could use to compare human and machine. We can challenge GPT-4 to play each game we develop, comparing human and machine as we did here.
When, for example, do humans use probability phrases and when do they use numerical probability? Dhami & Mandel38 in their review article argue that the choice between the use of numerical or verbal probabilities by senders is influenced by several factors. For example, Juanchich and Sirota39 find that, in the medical context, senders prefer to use numerical values when uncertainty is about very consequential events, as, for example, the serious side effects of a drug. Would GPT-4 have similar preferences?
Wallsten et al.40 found that most people preferred to receive information about the probability of a chance event in numerical form but preferred to transmit this information as a probabilistic phrase. Erev & Cohen41 referred to this pattern of preference as the Communication Mode Preference Paradox. Whatever justification we offer for transmitting a probabilistic phrase instead of a numerical probability would seem to apply to receiving it in the same form, an apparent paradox. Would GPT-4 exhibit the same paradox?
Senders' use of verbal probabilities has several effects other than conveying an estimate of uncertainty38. Honda42 found that the use of verbal probabilities, for example using positive rather than negative terms, can introduce bias into the decison making process. Verbal probabilities can also be used as receiver's and sender's face-saving strategy43,44,45. Does GPT-4 exhibit similar biases?
Comparing GPT-4 to human in these coordination games provides a systematic way to assess what GPT-4 (or any other LLM) can and cannot do, where its performance matches, exceeds, or falls short of, human. There is more to language than a series of coordination games but such games provide a scaffolding allowing us to describe what GPT-4 does in a principled way.
One puzzle we are left with is the compressions of probability and ambiguity we found. Despite these failures, GPT-4’s overall performance measured by discordance is better than that of the median human participant. Yet a simple scaling of output would “fix” the compression problem and presumably improve its performance. The methods used to develop GPT-4 did not result in an LLM that included appropriate scalings. Why?
Methods
GPT-4
Typical inputs for GPT-4 are shown in Table 3 for both contexts and for estimates of probability and ratings of ambiguity. Input consisted of a single context phrase followed by a request for a rating of either the probability or ambiguity of a specified probabilistic phrase. The phrases were presented in randomized order. GPT-4 was constrained to respond with a single percentage (probability) or number (ambiguity) and a justification. The values permitted were multiples of 5% from 0 to 100% (0%, 5%, …, 100%) and multiples of 5 from 0 to 100 for ambiguity ratings. We analyzed only the first run of GPT-4 in Figs. 1, 2, 3, 4, 5, 6.
Human participants
Fifty participants drawn from the New York University SONA Subject Pool agreed to participate in the experiment. All completed the experiment. We report the demographics of the pool. Almost all the potential participants in the pool were between the ages of 18–35 (93.34%), only a small portion of the participants were over 35 (6.38%) or under 18 (0.27%). 30.9% of the potential participants were male at birth and 69.0% were female at birth.
The research protocol and methods were approved by the Institutional Review Board for that Faculty of Arts and Science at New York University (IRB-FY2023-7544). Informed consent was obtained from all subjects and/or their legal guardian(s). Alll methods were carried out in accordance with relevant guidelines and regulations.
Procedure
Each participant was taken to a laboratory room and seated in front of a computer screen. Each participant was assigned at random to one of two contexts, investment or medical, 25 participants in each. Stimuli were presented and responses recorded using Google forms.
Payment
Participants were paid $12 / hour for the time spent in the experiment (typically 15 min or less).
Participants assigned to the medical context answered questions about probability phrases and ambiguity of probability phrases related to medical advice (See Table 3). The questions were prefaced by a single sentence signaling context as were those posed to GPT. Participant assigned to the investment context answered questions related to investment advice (See Table 3).
Questionnaires in both contexts consisted of same two sections: probability and ambiguity. Section "1A. Human vs. GPT-4: Probability" asked subjects to rate the probability of 23 words and phrases on a percentage scale from 0 to 100% (with 5% increments). Section "Comparisons across context" asked participants to rate the ambiguity of the same 23 phrases using a scale ranging from 0 to 100 in steps of 5. Questions within each section were randomly re-ordered for each participant.
Hypothesis tests
In each panel of Figs. 2, 4, 5, 6 we plotted corresponding data (e.g. estimates from human participants and corresponding estimates from GPT-4) as a scatterplot, allowing the reader to assess the relations between variables visually. We fit the data in each scatterplot by a univariate linear model. The null hypothesis for each test was that the points fell on the identity line with slope 1 and intercept 0 with added iid Gaussian error. We used hypothesis tests to detect deviations of slope from 1 or intercept from 0. We discuss any significant trend captured by the estimated slope and intercept values (compression or offset).
All tests were two-tailed with size 0.05. We report the t-statistic, the degrees of freedom and the exact p-value for each test and refer to outcomes with p-value less than 0.05 as “significant” in discussing the data. We classified two data points as outliers, and label them where they appear in the scatterplots by their probabilistic phrases in red. We report results with outliers excluded in the main text and report the analyses of all tests with outliers excluded and with outliers included in the Supplement.
Data availability
All data available from the corresponding author.
References
O’Brien, B. J. Words or numbers? The evaluation of probability expressions in general practice. J. R. Coll. Gen. Pract. 39(320), 98–100 (1989).
Schelling, T. C. The Strategy of Conflict. Harvard University Press (1960).
Lewis, D. Convention. Blackwell (2002).
Franke, M. Game-theoretic pragmatics. Philos. ompass 8(3), 269–284 (2013).
Benz, A. et al. (eds) Language, Games, and Evolution (Springer, 2011).
Benz, A. et al. (eds) Game Theory and Pragmatics (Springer, 2014).
Benz, A. & Stevens, J. Game-theoretic approaches to pragmatics. Annu. Rev. Ling. 4, 173–191 (2018).
Wittgenstein, L. Philosophical Investigations. Translation of Philosophische Untersuchungen, G. E. Anscombe [translator]. New York: Macmillan (1953).
Austin, J. L. How to do Things with Words (Oxford University Press, 1955).
Grice, P. Studies in the Way of Words. Harvard University Press (1991).
Mitchell, M. How do we know how smart AI systems are? Science, 381(6654) (2023).
Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s Large Language Models. arXiv:2210.13966v3 [cs. LG] (2023).
Burnell, R. et al. (13 more authors) Rethink reporting of evaluation results: Aggregate metrics and lack of access to results limit understanding. Science 380(6641), 136–138 (2023).
Kahneman, D. & Tversky, A. Prospect theory: An analysis of decision under risk. 47(2), 263–292 (1979).
Tversky, A. & Kahneman, D. Belief in the law of small numbers. Psychol. Bull. 76(2), 105–110 (1971).
Gilovich, T. How We Know What Isn’t So. Free Press (1993).
Gilovich, D., Griffin, T. & Kahneman, D. Heuristics and Biases. Cambridge University Press (2002).
Biever, C. The easy intelligence tests that AI chatbots fail. Nature 619, 686–689 (2023).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. arXiv:2212.09196v2 (2023).
Gurnee, W. & Tegmark, M. (2023), Language models represent space and time. arXiv:2310.02207v1
Poldrack, RA., Lu, T. & Beguš, G. AI assisted coding: Experiments with GPT-4. April 27, 2023, arXiv:2304.13187v1 [cs.AI] (2023).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Techn. J., 27 (3), 379–423, 623–656 (1948).
Shannon, C. E. & Weaver, W. The Mathematical Theory of Communication. Illinois (1949).
Lichtenstein, S. & Newman, J. R. Empirical scaling of common verbal phrases associated with numerical probabilities. Psychon. Sci. 9, 563–564 (1967).
Beyth-Marom, R. How probable is probable? A numerical translation of verbal probability expressions. J. Forecast. 1, 257–269 (1982).
Bryant, G. D. & Norman, G. R. Expressions of probability: Words and numbers. N. Engl. J. Med. 302, 411 (1980).
Budescu, D. V. & Wallsten, T. S. Consistency in interpretation of probabilistic phrases. Organ. Behav. Hum. Decis. Process. 36, 391–405 (1985).
Kong, A., Barnett, G. O., Mosteller, F. & Youtz, C. How medical professionals evaluate expressions of probability. N. Engl. J. Med. 315, 740–744 (1986).
Mapes, R. E. A. Verbal and numerical estimates of probability in therapeutic contexts. Soc. Sci. Med. 13A, 277–282 (1979).
Sawant, R. & Sansgiry, S. Communicating risk of medication side-effects: Role of communication format on risk perception. Pharm. Pract. 16, 1174 (2018).
Mellers, B. A. et al. How generalizable is a good judgement? A multi-task, multi-benchmark study. Judgm. Decis. Mak. 12, 369–381 (2017).
Mosteller, F. & Youtz, C. Quantifying probabilistic expressions. Stat. Sci. 5(1), 2–12 (1990).
Zhang, H. & Maloney, L. T. Ubiquitous log odds: A common representation of probability and frequency distortion in perception, action, and cognition. Front. Neurosci. 6, 1–14 (2012).
Tversky, A. & Kahneman, D. Advances in prospect theory: Cumulative representation of uncertainty. J. Risk Uncertain. 5(4), 297–323 (1992).
Faller, M. Dimensional adjectives and measure phrases in vector space semantics. In Formalizing the Dynamics of Information. In Faller, M., Kaufmann, S. & Pauly, M. [Eds.] CSLI Publications (1990).
Bierwisch, M. Some semantic universals of German adjectivals. Found. Lang. 3, 1–36 (1967).
Maloney, L. T. & Gelman, S. A. Measuring the influence of context: The interpretation of dimensional adjectives. Lang. Cogn. Process. 2(3–4), 205–215 (1987).
Dhami, M. K. & Mandel, D. R. Communicating uncertainty using words and numbers. Trends Cogn. Sci. 26(6), 514–526 (2022).
Juanchich, M. & Sirota, M. Most family physicians report communicating the risks of adverse drug reactions in words (vs, numbers). Appl. Cogn. Psychol. 34, 526–534 (2020).
Wallsten, T. S., Budescu, D. V., Zwick, R. & Kemp, S. M. Preferences and reasons for communicating probabilistic information in verbal or numerical terms. Bull. Psychon. Soc. 31(2), 135–138 (1993).
Erev, I. & Cohen, B. L. Verbal versus numerical probabilities: Efficiency, biases, and the preference paradox. Organ. Behav. Hum. Decis. Process. 45(1), 1–18 (1990).
Honda, H. et al. Decisions based on verbal probabilities: Decision bias or decision by belief sampling? In Proceedings of the 39th Annual Conference of the Cognitive Science Society (Gunzelmann, G. et al., Eds.), 557–562, Cognitive Science Society (2017).
Bonnefon, J.-F. & Villejoubert, G. Tactful or doubtful? Expectations of politeness explain the severity bias in the interpretation of probability phrases. Psychol. Sci. 17, 747–751 (2006).
Juanchich, M. et al. The perceived functions of linguistic risk quantifiers and their effect on risk, negativity perception and decision making. Organ. Behav. Hum. Decis. Process. 118, 72–81 (2012).
Jenkins, S. C. & Harris, A. J. L. Maintaining credibility when communicating uncertainty: the role of directionality. Think. Reason. 27, 97–12 (2020).
Acknowledgements
LTM: Partial support from the Institut d’études avancées de Paris, Paris, France. We thank Garrison Cottrell, Catherine Hanson, Stephen Hanson, Boris Power, and Edwin Williams for comments on earlier drafts.
Author information
Authors and Affiliations
Contributions
L.T.M., M.D.M., V.F. and V.M. designed the experiment and analyses. V.F. and V.M. implemented and ran the design as a computer-controlled experiment with human participants and M.D.M. implemented and ran the GPT-4 version. L.T.M., M.D.M., V.F. and V.M. analyzed data and prepared plots and tables. L.T.M., M.D.M., V.F., V.M. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Maloney, L.T., Dal Martello, M.F., Fei, V. et al. A comparison of human and GPT-4 use of probabilistic phrases in a coordination game. Sci Rep 14, 6835 (2024). https://doi.org/10.1038/s41598-024-56740-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-56740-9
Keywords
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.