Large pre-trained language models contain human-like biases of what is right and wrong to do

Schramowski, Patrick; Turan, Cigdem; Andersen, Nico; Rothkopf, Constantin A.; Kersting, Kristian

doi:10.1038/s42256-022-00458-8

Article
Published: 23 March 2022

Large pre-trained language models contain human-like biases of what is right and wrong to do

Nature Machine Intelligence volume 4, pages 258–268 (2022)Cite this article

5774 Accesses
60 Citations
157 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, GPT-2 and GPT-3. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many natural language processing tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show here that recent LMs also contain human-like biases of what is right and wrong to do, reflecting existing ethical and moral norms of society. We show that these norms can be captured geometrically by a ‘moral direction’ which can be computed, for example, by a PCA, in the embedding space. The computed ‘moral direction’ can rate the normativity (or non-normativity) of arbitrary phrases without explicitly training the LM for this task, reflecting social norms well. We demonstrate that computing the ’moral direction’ can provide a path for attenuating or even preventing toxic degeneration in LMs, showcasing this capability on the RealToxicityPrompts testbed.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: The MoralDirection approach rating the normativity of phrases.**

**Fig. 3: The MD-based detoxification approach is reducing the generated toxicity of neural language models.**

Augmenting interpretable models with large language models during training

Article Open access 30 November 2023

Chandan Singh, Armin Askari, … Jianfeng Gao

The future landscape of large language models in medicine

Article Open access 10 October 2023

Jan Clusmann, Fiona R. Kolbinger, … Jakob Nikolas Kather

Testing the limits of natural language models for predicting human language judgements

Article 14 September 2023

Tal Golan, Matthew Siegelman, … Christopher Baldassano

Data availability

The user study data are available at the code repository https://github.com/ml-research/MoRT_NMI/tree/master/Supplemental_Material/UserStudy. The generated text using the presented approach is available at https://hessenbox.tu-darmstadt.de/public?folderID=MjR2QVhvQmc0blFpdWd1YjViNHpz. The RealToxicityPrompts data are available at https://allenai.org/data/real-toxicity-prompts/.

Code availability

The code to reproduce the figures and results of this article, including pre-trained models, can be found at https://github.com/ml-research/MoRT_NMI (archived at https://doi.org/10.5281/zenodo.5906596).

References

Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186 (2019).
Peters, M. E. et al. Deep contextualized word representations. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Walker, M. A., Ji, H. & Stent, A.) 2227–2237 (Association for Computational Linguistics, 2018).
Yang, Z. et al. Xlnet: Generalized autoregressive pretraining for language understanding. In Adv. Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds Wallach, H. M. et al.) 5754–5764 (2019).
Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.) (2020).
Next chapter in artificial writing. Nat. Mach. Intell. 2, 419 (2020).
Goldberg, Y. Assessing BERT’s syntactic abilities. Preprint at https://arxiv.org/abs/1901.05287 (2019).
Lin, Y., Tan, Y. & Frank, R. Open Sesame: Getting inside bert’s linguistic knowledge. In Proc. 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 241–253 (2019).
Reif, E. et al. Visualizing and measuring the geometry of BERT. In Adv. Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (eds. Wallach, H. M. et al.) 8592–8600 (2019).
Shwartz, V. & Dagan, I. Still a pain in the neck: Evaluating text representations on lexical composition. Trans. Assoc. Comput. Linguistics 7, 403–419 (2019).
Article Google Scholar
Tenney, I. et al. What do you learn from context? probing for sentence structure in contextualized word representations. In Proc. 7th International Conference on Learning Representations (OpenReview.net, 2019).
Talmor, A., Elazar, Y., Goldberg, Y. & Berant, J. oLMpics - on what language model pre-training captures. Trans. Assoc. Computational Linguistics 8, 743–758 (2020).
Article Google Scholar
Roberts, A., Raffel, C. & Shazeer, N. How much knowledge can you pack into the parameters of a language model? In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 5418–5426 (Association for Computational Linguistics, 2020).
Petroni, F. et al. Language models as knowledge bases? In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 2463–2473 (Association for Computational Linguistics, 2019).
Doctor gpt-3: hype or reality? Nabla https://www.nabla.com/blog/gpt-3/ (Accessed 28 February 2021).
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (eds. Cohn, T., He, Y. & Liu, Y.) 3356–3369 (Association for Computational Linguistics, 2020).
Abid, A., Farooqi, M. & Zou, J. Persistent anti-muslim bias in large language models. In Proc. AAAI/ACM Conference on AI, Ethics, and Society 298–306 (Association for Computing Machinery, 2021).
Microsoft’s racist chatbot revealed the dangers of online conversation. IEEE Spectrum https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation (25 November 2019).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proc. ACM Conference on Fairness, Accountability, and Transparency (eds. Elish, M. C., Isaac, W. & Zemel, R. S.) 610–623 (2021).
Hutson, M. Robo-writers: the rise and risks of language-generating AI. Nature 591, 22–56 (2021).
Article Google Scholar
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Article Google Scholar
Jentzsch, S., Schramowski, P., Rothkopf, C. A. & Kersting, K. Semantics derived automatically from language corpora contain human-like moral choices. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES), 37-44 (2019).
Schramowski, P., Turan, C., Jentzsch, S., Rothkopf, C. A. & Kersting, K. The moral choice machine. Front. Artif. Intell. 3, 36 (2020).
Article Google Scholar
Churchland, P. Conscience: The Origins of Moral Intuition (W. W. Norton, 2019).
Christakis, N. A. The neurobiology of conscience. Nature 569, 627–628 (2019).
Article Google Scholar
Gert, B. & Gert, J. In The Stanford Encyclopedia of Philosophy Fall 2020 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2020).
Alexander, L. & Moore, M. In The Stanford Encyclopedia of Philosophy Summer 2021 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2021).
Bicchieri, C., Muldoon, R. & Sontuoso, A. In The Stanford Encyclopedia of Philosophy Winter 2018 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2018).
Bolukbasi, T., Chang, K., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proc. Neural information Processing 4349–4357 (Curran Associates, 2016).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing (2019).
Cer, D. et al. Universal sentence encoder for English. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds. Blanco, E. & Lu, W.) 169–174 (Association for Computational Linguistics, 2018).
Radford, A. et al. Language Models are Unsupervised Multitask Learners (2019).
Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains and tasks. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J. R.) 8342–8360 (Association for Computational Linguistics, 2020).
Dathathri, S. et al. Plug and play language models: A simple approach to controlled text generation. In Proc,. 8th International Conference on Learning Representations (OpenReview.net, 2020).
Peng, X., Li, S., Frazier, S. & Riedl, M. Reducing non-normative text generation from language models. In Proc. 13th International Conference on Natural Language Generation 374–383 (Association for Computational Linguistics, 2020).
Chen, M. X. et al. Gmail smart compose: Real-time assisted writing. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (eds. Teredesai, A. et al.) 2287–2295 (ACM, 2019).
GPT-3 Powers the Next Generation of Apps. OpenAI https://openai.com/blog/gpt-3-apps/ (Accessed 22 January 2022).
Forbes, M., Hwang, J. D., Shwartz, V., Sap, M. & Choi, Y. Social chemistry 101: Learning to reason about social and moral norms. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 653–670 (Association for Computational Linguistics, 2020).
Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proc. International Joint Conference on Artificial Intelligence 2662–2670 (2017).
Teso, S. & Kersting, K. Explanatory interactive machine learning. In Proc. AAAI/ACM Conference on AI, Ethics, and Society (2019).
Schramowski, P. et al. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat. Mach. Intell. 2, 476–486 (2020).
Article Google Scholar
Berreby, F., Bourgne, G. & Ganascia, J.-G. Modelling moral reasoning and ethical responsibility with logic programming. In Logic for Programming, Artificial Intelligence, and Reasoning (eds. Davis, M., Fehnker, A., McIver, A. & Voronkov, A.) 532–548 (Springer, 2015).
Pereira, L. M. & Saptawijaya, A. Modelling morality with prospective logic. Int. J. Reason. Based Intell. Syst. 1, 209–221 (2009).
Google Scholar
Levine, S., Kleiman-Weiner, M., Schulz, L., Tenenbaum, J. & Cushman, F. The logic of universalization guides moral judgment. Proc. Natl Acad. Sci. USA 117, 26158–26169 (2020).
Article Google Scholar
Turney, P. D. & Pantel, P. From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010).
Article MathSciNet Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Neural Information Processing Systems 3111–3119 (2013).
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. & Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In Proc. 2017 Conference on Empirical Methods in Natural Language Processing 670–680 (2017).
Zhu, Y. et al. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In 2015 IEEE Int. Conf. Computer Vision 19–27 (IEEE Computer Society, 2015).
Shafer-Landau, R. Ethical Theory: An Anthology Vol. 13 (John Wiley & Sons, 2012).
Fassin, D. A Companion to Moral Anthropology (Wiley Online Library, 2012).
Sumner, L. W. Normative ethics and metaethics. Ethics 77, 95–106 (1967).
Article Google Scholar
Katzenstein, P. et al. The Culture of National Security: Norms and Identity in World Politics. New Directions in World Politics (Columbia Univ. Press, 1996).
Lindström, B., Jangard, S., Selbing, I. & Olsson, A. The role of a ‘common is moral’ heuristic in the stability and change of moral norms. J. Exp. Psychol. 147, 228–242 (2018).
Article Google Scholar
Hendrycks, D. et al. Aligning AI with shared human values. In Proc. Int. Conf. Learning Representations (OpenReview.net, 2021).
Reif, E. et al. Visualizing and measuring the geometry of BERT. In Proc. Annu. Conf. Neural Information Processing Systems 8592–8600 (2019).
Chen, B. et al. Probing BERT in hyperbolic spaces. In 9th Int. Conf. Learning Representations (2021).
Chami, I., Gu, A., Nguyen, D. & Ré, C. Horopca: Hyperbolic dimensionality reduction via horospherical projections. In Proc. 35th Int. Conf. Machine Learning (2021).
Kurita, K., Vyas, N., Pareek, A., Black, A. W. & Tsvetkov, Y. Measuring bias in contextualized word representations. In Proc. First Workshop on Gender Bias in Natural Language Processing 166–172 (Association for Computational Linguistics, 2019).
Tan, Y. C. & Celis, L. E. Assessing social and intersectional biases in contextualized word representations. In Proc. Advances in Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (Wallach, H. M. et al.) 13209–13220 (2019).
Zhang, Z. et al. Semantics-aware BERT for language understanding. In Proc. 34th AAAI Conference on Artificial Intelligence 9628–9635 (AAAI Press, 2020).
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable feedback. Further, the authors are thankful to Aleph Alpha for very useful feedback and access to the GPT-3 API. This work benefited from the ICT-48 Network of AI Research Excellence Center ‘TAILOR’ (EU Horizon 2020, grant agreement no. 952215) (K.K.), the Hessian research priority programme LOEWE within the project WhiteBox (K.K. and C.R.), and the Hessian Ministry of Higher Education, Research and the Arts (HMWK) cluster projects ‘The Adaptive Mind’ (K.K. and C.R.) and ‘The Third Wave of AI’ (K.K., C.R. and P.S.).

Author information

Authors and Affiliations

Technical University of Darmstadt, Computer Science Department, Artificial Intelligence and Machine Learning Lab, Darmstadt, Germany
Patrick Schramowski, Cigdem Turan & Kristian Kersting
Technical University of Darmstadt, Centre for Cognitive Science, Darmstadt, Germany
Cigdem Turan, Constantin A. Rothkopf & Kristian Kersting
Leibniz Institute for Research and Information in Education, Frankfurt am Main, Germany
Nico Andersen
Technical University of Darmstadt, Institute of Psychology, Darmstadt, Germany
Constantin A. Rothkopf
Hessian Center for Artificial Intelligence (hessian.ai), Darmstadt, Germany
Constantin A. Rothkopf & Kristian Kersting

Authors

Patrick Schramowski
View author publications
You can also search for this author in PubMed Google Scholar
Cigdem Turan
View author publications
You can also search for this author in PubMed Google Scholar
Nico Andersen
View author publications
You can also search for this author in PubMed Google Scholar
Constantin A. Rothkopf
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Kersting
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.S. and C.T. contributed equally to the work. P.S., C.T. and K.K. designed the study. P.S., C.T., C.R. and K.K. interpreted the data and drafted the manuscript. C.T. and N.A. designed the conducted user study. C.T. performed and analysed the user study. P.S. performed and analysed the text generation study. C.R. and K.K. directed the research and gave initial input. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Patrick Schramowski or Cigdem Turan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 BERT has a moral direction.

The direction is defined by a PCA computed on BERT-based sentence embeddings. The top PC, the moral direction m, is dividing the x axis into Dos and Don’ts. The displayed verbs were used to compute the PCA.

Extended Data Fig. 2 Overview of methods applied to investigate LMs mirrored moral values and norm.

(a) The LAMA framework with a prompt designed to analyse the moral values mirrored by the LM. (b) The question-answering approach and (c) our proposed MD approach. The BERT module is a placeholder for the LM.

Extended Data Fig. 3 Overview of participants of AMT user study.

(a) The participant’s location grouped by country and continent. (b) The age distribution and (c) the gender distribution. In total 234 volunteers participated in the study.

Supplementary information

Supplementary Information

Supplementary Sections A–D, Tables 1–7 and Figs 1–3.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schramowski, P., Turan, C., Andersen, N. et al. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat Mach Intell 4, 258–268 (2022). https://doi.org/10.1038/s42256-022-00458-8

Download citation

Received: 08 March 2021
Accepted: 09 February 2022
Published: 23 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1038/s42256-022-00458-8

This article is cited by

CancerGPT for few shot drug pair synergy prediction using large pretrained language models
- Tianhao Li
- Sandesh Shetty
- Yejin Kim
npj Digital Medicine (2024)
Generative AI
- Stefan Feuerriegel
- Jochen Hartmann
- Patrick Zschech
Business & Information Systems Engineering (2024)
GenAI against humanity: nefarious applications of generative artificial intelligence and large language models
- Emilio Ferrara
Journal of Computational Social Science (2024)
Harnessing human and machine intelligence for planetary-level climate action
- Ramit Debnath
- Felix Creutzig
- Emily Shuckburgh
npj Climate Action (2023)
The future landscape of large language models in medicine
- Jan Clusmann
- Fiona R. Kolbinger
- Jakob Nikolas Kather
Communications Medicine (2023)