Testing the limits of natural language models for predicting human language judgements

Golan, Tal; Siegelman, Matthew; Kriegeskorte, Nikolaus; Baldassano, Christopher

doi:10.1038/s42256-023-00718-1

Article
Published: 14 September 2023

Testing the limits of natural language models for predicting human language judgements

Nature Machine Intelligence volume 5, pages 952–964 (2023)Cite this article

2213 Accesses
1 Citations
239 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Neural network language models appear to be increasingly aligned with how humans process and generate language, but identifying their weaknesses through adversarial examples is challenging due to the discrete nature of language and the complexity of human language perception. We bypass these limitations by turning the models against each other. We generate controversial sentence pairs where two language models disagree about which sentence is more likely to occur. Considering nine language models (including n-gram, recurrent neural networks and transformers), we created hundreds of controversial sentence pairs through synthetic optimization or by selecting sentences from a corpus. Controversial sentence pairs proved highly effective at revealing model failures and identifying models that aligned most closely with human judgements of which sentence is more likely. The most human-consistent model tested was GPT-2, although experiments also revealed substantial shortcomings in its alignment with human perception.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Model comparison using natural sentences.**

**Fig. 2: Synthesizing controversial sentence pairs.**

**Fig. 3: Model comparison using synthetic sentences.**

**Fig. 4: Ordinal correlation of the models’ sentence probability log ratios and human Likert ratings.**

Brains and algorithms partially converge in natural language processing

Article Open access 16 February 2022

Evidence of a predictive coding hierarchy in the human brain listening to speech

Article Open access 02 March 2023

Languages with more speakers tend to be harder to (machine-)learn

Article Open access 28 October 2023

Data availability

The experimental stimuli, detailed behavioural testing results and code for reproducing all analyses and figures are available at github.com/dpmlab/contstimlang (ref. ⁶⁷).

Code availability

Sentence optimization code is available at github.com/dpmlab/contstimlang (ref. ⁶⁷).

References

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Article MATH Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/n19-1423
Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at https://arxiv.org/abs/1907.11692 (2019).
Conneau, A. & Lample, G. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) Vol. 32 (Curran Associates, 2019); https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
Clark, K., Luong, M., Le, Q. V. & Manning, C. D. ELECTRA: pre-training text encoders as discriminators rather than generators. In Proc. 8th International Conference on Learning Representations ICLR 2020 (ICLR, 2020); https://openreview.net/forum?id=r1xMH1BtvB
Radford, A. et al. Language Models are Unsupervised Multitask Learners (OpenAI, 2019); https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Goodkind, A. & Bicknell, K. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proc. 8th Workshop on Cognitive Modeling and Computational Linguistics, CMCL 2018 10–18 (Association for Computational Linguistics, 2018); https://doi.org/10.18653/v1/W18-0102
Shain, C., Blank, I. A., Schijndel, M., Schuler, W. & Fedorenko, E. fMRI reveals language-specific predictive coding during naturalistic sentence comprehension. Neuropsychologia 138, 107307 (2020).
Article Google Scholar
Broderick, M. P., Anderson, A. J., Di Liberto, G. M., Crosse, M. J. & Lalor, E. C. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Curr. Biol. 28, 803–809 (2018).
Article Google Scholar
Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).
Article Google Scholar
Lau, J. H., Clark, A. & Lappin, S. Grammaticality, acceptability and probability: a probabilistic view of linguistic knowledge. Cogn. Sci. 41, 1202–1241 (2017).
Article Google Scholar
Lau, J. H., Armendariz, C., Lappin, S., Purver, M. & Shu, C. How furiously can colorless green ideas sleep? Sentence acceptability in context. Trans. Assoc. Comput. Ling. 8, 296–310 (2020).
Google Scholar
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. 7th International Conference on Learning Representations, ICLR 2019 (ICLR, 2019); https://openreview.net/forum?id=rJ4km2R5t7
Wang, A. et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 3266–3280 (Curran Associates, 2019); https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
Warstadt, A. et al. BLiMP: the benchmark of linguistic minimal pairs for English. Trans. Assoc. Comput. Ling. 8, 377–392 (2020).
Google Scholar
Kiela, D. et al. Dynabench: rethinking benchmarking in NLP. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4110–4124 (Association for Computational Linguistics, 2021); https://doi.org/10.18653/v1/2021.naacl-main.324
Box, G. E. P. & Hill, W. J. Discrimination among mechanistic models. Technometrics 9, 57–71 (1967).
Article MathSciNet Google Scholar
Golan, T., Raju, P. C. & Kriegeskorte, N. Controversial stimuli: pitting neural networks against each other as models of human cognition. Proc. Natl Acad. Sci. USA 117, 29330–29337 (2020).
Article Google Scholar
Cross, D. V. Sequential dependencies and regression in psychophysical judgments. Perception Psychophys. 14, 547–552 (1973).
Article Google Scholar
Foley, H. J., Cross, D. V. & O’reilly, J. A. Pervasiveness and magnitude of context effects: evidence for the relativity of absolute magnitude estimation. Perception Psychophys. 48, 551–558 (1990).
Article Google Scholar
Petzschner, F. H., Glasauer, S. & Stephan, K. E. A Bayesian perspective on magnitude estimation. Trends Cogn. Sci. 19, 285–293 (2015).
Article Google Scholar
Greenbaum, S. Contextual influence on acceptability judgments. Linguistics 15, 5–12 (1977).
Article Google Scholar
Schütze, C. T. & Sprouse, J. in Research Methods in Linguistics (eds Podesva, R. J. & Sharma, D.) 27–50 (Cambridge Univ. Press, 2014); https://doi.org/10.1017/CBO9781139013734.004
Sprouse, J. & Almeida, D. Design sensitivity and statistical power in acceptability judgment experiments. Glossa 2, 14 (2017).
Article Google Scholar
Lindsay, G. W. Convolutional neural networks as a model of the visual system: past, present and future. J. Cogn. Neurosci. 33, 2017–2031 (2021).
Article Google Scholar
Wehbe, L., Vaswani, A., Knight, K. & Mitchell, T. Aligning context-based statistical models of language with brain activity during reading. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 233–243 (Association for Computational Linguistics, 2014); https://doi.org/10.3115/v1/D14-1030
Toneva, M. & Wehbe, L. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) Vol. 32 (Curran Associates, 2019); https://proceedings.neurips.cc/paper/2019/file/749a8e6c231831ef7756db230b4359c8-Paper.pdf
Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P. & De Lange, F. P. A hierarchy of linguistic predictions during natural language comprehension. Proc. Natl Acad. Sci. USA 119, 2201968119 (2022).
Article Google Scholar
Jain, S. et al. Interpretable multi-timescale models for predicting fMRI responses to continuous natural speech. In Advances in Neural Information Processing Systems (eds Larochelle, H. et al.) Vol. 33, 13738–13749 (Curran Associates, 2020); https://proceedings.neurips.cc/paper_files/paper/2020/file/9e9a30b74c49d07d8150c8c83b1ccf07-Paper.pdf
Lyu, B., Marslen-Wilson, W. D., Fang, Y. & Tyler, L. K. Finding structure in time: humans, machines and language. Preprint at https://www.biorxiv.org/content/10.1101/2021.10.25.465687v2 (2021).
Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, 2105646118 (2021).
Article Google Scholar
Wilcox, E., Vani, P. & Levy, R. A targeted assessment of incremental processing in neural language models and humans. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 939–952 (Association for Computational Linguistics, 2021); https://doi.org/10.18653/v1/2021.acl-long.76
Caucheteux, C. & King, J.-R. Brains and algorithms partially converge in natural language processing. Commun. Biol. 5, 134 (2022).
Article Google Scholar
Arehalli, S., Dillon, B. & Linzen, T. Syntactic surprisal from neural models predicts, but underestimates, human processing difficulty from syntactic ambiguities. In Proc. 26th Conference on Computational Natural Language Learning (CoNLL) 301–313 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.conll-1.20
Merkx, D. & Frank, S. L. Human sentence processing: recurrence or attention? In Proc. Workshop on Cognitive Modeling and Computational Linguistics 12–22 (Association for Computational Linguistics, 2021); https://doi.org/10.18653/v1/2021.cmcl-1.2
Michaelov, J. A., Bardolph, M. D., Coulson, S. & Bergen, B. K. Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude? In Proc. Annual Meeting of the Cognitive Science Society Vol. 43 (2021); https://escholarship.org/uc/item/9z06m20f
Rakocevic, L. I. Synthesizing controversial sentences for testing the brain-predictivity of language models. PhD thesis, Massachusetts Institute of Technology (2021); https://hdl.handle.net/1721.1/130713
Goodman, N. D. & Frank, M. C. Pragmatic language interpretation as probabilistic inference. Trends Cogn. Sci. 20, 818–829 (2016).
Article Google Scholar
Howell, S. R., Jankowicz, D. & Becker, S. A model of grounded language acquisition: sensorimotor features improve lexical and grammatical learning. J. Mem. Lang. 53, 258–276 (2005).
Article Google Scholar
Szegedy, C. et al. Intriguing properties of neural networks. Preprint at http://arxiv.org/abs/1312.6199 (2013).
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In Proc. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings (2015); http://arxiv.org/abs/1412.6572
Zhang, W. E., Sheng, Q. Z., Alhazmi, A. & Li, C. Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Trans. Intell. Syst. Technol. 11, 1–41 (2020).
Google Scholar
Liang, B. et al. Deep text classification can be fooled. In Proc. Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 4208–4215 (International Joint Conferences on Artificial Intelligence Organization, 2018); https://doi.org/10.24963/ijcai.2018/585
Ebrahimi, J., Rao, A., Lowd, D. & Dou, D. HotFlip: white-box adversarial examples for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 31–36 (Association for Computational Linguistics, 2018); https://doi.org/10.18653/v1/P18-2006
Abdou, M. et al. The sensitivity of language models and humans to Winograd schema perturbations. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 7590–7604 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.acl-main.679
Alzantot, M. et al. Generating natural language adversarial examples. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 2890–2896 (Association for Computational Linguistics, 2018); https://doi.org/10.18653/v1/D18-1316
Ribeiro, M. T., Singh, S. & Guestrin, C. Semantically equivalent adversarial rules for debugging NLP models. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 856–865 (Association for Computational Linguistics, 2018); https://doi.org/10.18653/v1/P18-1079
Ren, S., Deng, Y., He, K. & Che, W. Generating natural language adversarial examples through probability weighted word saliency. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 1085–1097 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/P19-1103
Morris, J., Lifland, E., Lanchantin, J., Ji, Y. & Qi, Y. Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020 3829–3839 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.findings-emnlp.341
Wallace, E., Rodriguez, P., Feng, S., Yamada, I. & Boyd-Graber, J. Trick me if you can: human-in-the-loop generation of adversarial examples for question answering. Trans. Assoc. Comput. Ling. 7, 387–401 (2019).
Google Scholar
Perez, E. et al. Red teaming language models with language models. In Proc.of the 2022 Conference on Empirical Methods in Natural Language Processing 3419–3448 (Association for Computational Linguistics, 2022); https://doi.org/10.18653/v1/2022.emnlp-main.225
Gibson, E. Linguistic complexity: locality of syntactic dependencies. Cognition 68, 1–76 (1998).
Article Google Scholar
Watt, W. C. The indiscreteness with which impenetrables are penetrated. Lingua 37, 95–128 (1975).
Article Google Scholar
Schütze, C. T. The Empirical Base of Linguistics, Classics in Linguistics Vol. 2 (Language Science Press, 2016); https://doi.org/10.17169/langsci.b89.100
Bird, S., Klein, E. & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit (‘O’Reilly Media, 2009).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) Vol. 32, 8024–8035 (Curran Associates, 2019); http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 38–45 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.emnlp-demos.6
Yamakoshi, T., Griffiths, T. & Hawkins, R. Probing BERT’s priors with serial reproduction chains. In Findings of the Association for Computational Linguistics, ACL 2022 3977–3992 (Association for Computational Linguistics, 2022); https://doi.org/10.18653/v1/2022.findings-acl.314
Chestnut, S. Perplexity https://drive.google.com/uc?export=download&id=1gSNfGQ6LPxlNctMVwUKrQpUA7OLZ83PW (accessed 23 September 2022).
Heuven, W. J. B., Mandera, P., Keuleers, E. & Brysbaert, M. Subtlex-UK: a new and improved word frequency database for British English. Q. J. Exp. Psychol. 67, 1176–1190 (2014).
Article Google Scholar
Wang, Z. & Simoncelli, E. P. Maximum differentiation (MAD) competition: a methodology for comparing computational models of perceptual quantities. J. Vision 8, 8 (2008).
Article Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B (Methodol.) 57, 289–300 (1995).
MathSciNet MATH Google Scholar
Wang, A. & Cho, K. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Proc. Workshop on Methods for Optimizing and Evaluating Neural Language Generation 30–36 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/W19-2304
Cho, K. BERT has a mouth and must speak, but it is not an MRF https://kyunghyuncho.me/bert-has-a-mouth-and-must-speak-but-it-is-not-an-mrf/ (accessed 28 September 2022).
Salazar, J., Liang, D., Nguyen, T. Q. & Kirchhoff, K. Masked language model scoring. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 2699–2712 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.acl-main.240
Golan, T., Siegelman, M., Kriegeskorte, N. & Baldassano, C. Code and data for ‘Testing the limits of natural language models for predicting human language judgments’ (Zenodo, 2023); https://doi.org/10.5281/zenodo.8147166

Download references

Acknowledgements

This material is based on work partially supported by the National Science Foundation under grant no. 1948004 to N.K. This publication was made possible with the support of the Charles H. Revson Foundation (to T.G.). The statements made and views expressed, however, are solely the responsibility of the authors.

Author information

These authors contributed equally: Tal Golan, Matthew Siegelman.

Authors and Affiliations

Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
Tal Golan & Nikolaus Kriegeskorte
Department of Cognitive and Brain Sciences, Ben-Gurion University of the Negev, Be’er-Sheva, Israel
Tal Golan
Department of Psychology, Columbia University, New York, NY, USA
Matthew Siegelman, Nikolaus Kriegeskorte & Christopher Baldassano
Department of Neuroscience, Columbia University, New York, NY, USA
Nikolaus Kriegeskorte
Department of Electrical Engineering, Columbia University, New York, NY, USA
Nikolaus Kriegeskorte

Authors

Tal Golan
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Siegelman
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Kriegeskorte
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Baldassano
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.G., M.S., N.K. and C.B. designed the study. M.S. implemented the computational models and T.G. implemented the sentence pair optimization procedures. M.S. conducted the behavioural experiments. T.G. and M.S. analysed the experiments’ results. T.G., M.S., N.K. and C.B. wrote the paper.

Corresponding author

Correspondence to Tal Golan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Jacob Huth, in collaboration with the Nature Machine Intelligence team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 An example of one experimental trial, as presented to the participants.

The participant must choose one sentence while providing their confidence rating on a 3-point scale.

Extended Data Fig. 2 Between-model agreement rate on the probability ranking of the 90 randomly sampled and paired natural sentence pairs evaluated in the experiment.

Each cell represents the proportion of sentence pairs for which two models make congruent probability ranking (that is, both models assign a higher probability to sentence 1, or both models assign a higher probability to sentence 2).

Extended Data Fig. 3 Pairwise model comparison of model-human consistency.

For each pair of models (represented as one cell in the matrices above), the only trials considered were those in which the stimuli were either selected (a) or synthesized (b) to contrast the predictions of the two models. For these trials, the two models always made controversial predictions (that is, one sentence is preferred by the first model and the other sentence is preferred by the second model). The matrices above depict the proportion of trials in which the binarized human judgments aligned with the row model (‘model 1’). For example, GPT-2 (top-row) was always more aligned (green hues) with the human choices than its rival models. In contrast, 2-gram (bottom-row) was always less aligned (purple hues) with the human choices than its rival models.

Extended Data Fig. 4 Pairwise model analysis of human response for natural vs. synthetic sentence pairs.

In each optimization condition, a synthetic sentence s was formed by modifying a natural sentence n so the synthetic sentence would be ‘rejected’ by one model (m_reject, columns), minimizing p(s∣m_reject), and would be ‘accepted’ by another model (m_accept, rows), satisfying the constraint p(s∣m_accept)≥p(n∣m_accept). Each cell above summarizes model-human agreement in trials resulting from one such optimization condition. The color of each cell denotes the proportion of trials in which humans judged a synthetic sentence to be more likely than its natural counterpart and hence aligned with m_accept. For example, the top-right cell depicts human judgments for sentence pairs formed to minimize the probability assigned to the synthetic sentence by the simple 2-gram model while ensuring that GPT-2 would judge the synthetic sentence to be at least as likely as the natural sentence; humans favored the synthetic sentence in only 22 out the 100 sentence pairs in this condition.

Extended Data Fig. 5 Human consistency of bidirectional transformers: approximate log-likelihood versus pseudo-log-likelihood (PLL).

Each dot in the plots above depicts the ordinal correlation between the judgments of one participant and the predictions of one model. (a) The performance of BERT, RoBERTa, and ELECTRA in predicting the human judgments of randomly sampled natural sentence pairs in the main experiment, using two different likelihood measures: our novel approximate likelihood method (that is, averaging multiple conditional probability chains, see Methods) and pseudo-likelihood (PLL, summating the probability of each word given all of the other words⁶⁴). For each model, we statistically compared the two likelihood measures to each other and to the noise ceiling using a two-sided Wilcoxon signed-rank test across the participants. False discovery rate was controlled at q < 0.05 for the 9 comparisons. When predicting human preferences of natural sentences, the pseudo-log-likelihood measure is at least as accurate as our proposed approximate log-likelihood measure. (b) Results from a follow-up experiment, in which we synthesized synthetic sentence pairs for each of the model pairs, pitting the two alternative likelihood measures against each other. Statistical testing was conducted in the same fashion as in panel a. These results indicate that for each of the three bidirectional language models, the approximate log-likelihood measure is considerably and significantly (q < 0.05) more human-consistent than the pseudo-likelihood measure. Synthetic controversial sentence pairs uncover a dramatic failure mode of the pseudo-log-likelihood measure, which remains covert when the evaluation is limited to randomly-sampled natural sentences. See Extended Data Table 2 for synthetic sentence pair examples.

Extended Data Fig. 6 Model prediction accuracy for pairs of natural and synthetic sentences, evaluating each model across all of the sentence pairs in which it was targeted to rate the synthetic sentence to be less probable than the natural sentence.

The data binning applied here is complementary to the one used in Fig. 3b, where each model was evaluated across all of the sentence pairs in which it was targeted to rate the synthetic sentence to be at least as probable as the natural sentence. Unlike Fig. 3b, where all of the models performed poorly, here no models were found to be significantly below the lower bound on the noise ceiling; typically, when a sentence was optimized to decrease its probability under any model (despite the sentence probability not decreasing under a second model), humans agreed that the sentence became less probable.

Extended Data Table 1 Examples of pairs of synthetic and natural sentences that maximally contributed to each model’s prediction error

Full size table

Extended Data Table 2 Examples of controversial synthetic-sentence pairs that maximally contributed to the prediction error of bidirectional transformers using pseudo-log-likelihood (PLL)

Full size table

Supplementary information

Supplementary Information

Supplementary methods 1.1–1.3, results 2.1–2.3, Figs. 1–3 and Table 1.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Golan, T., Siegelman, M., Kriegeskorte, N. et al. Testing the limits of natural language models for predicting human language judgements. Nat Mach Intell 5, 952–964 (2023). https://doi.org/10.1038/s42256-023-00718-1

Download citation

Received: 02 June 2022
Accepted: 11 August 2023
Published: 14 September 2023
Issue Date: September 2023
DOI: https://doi.org/10.1038/s42256-023-00718-1

This article is cited by

The Three Terms Task - an open benchmark to compare human and artificial semantic representations
- V. Borghesani
- J. Armoza
- S. M. Brambati
Scientific Data (2023)