Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Large pre-trained language models contain human-like biases of what is right and wrong to do

A preprint version of the article is available at arXiv.

Abstract

Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, GPT-2 and GPT-3. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many natural language processing tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show here that recent LMs also contain human-like biases of what is right and wrong to do, reflecting existing ethical and moral norms of society. We show that these norms can be captured geometrically by a ‘moral direction’ which can be computed, for example, by a PCA, in the embedding space. The computed ‘moral direction’ can rate the normativity (or non-normativity) of arbitrary phrases without explicitly training the LM for this task, reflecting social norms well. We demonstrate that computing the ’moral direction’ can provide a path for attenuating or even preventing toxic degeneration in LMs, showcasing this capability on the RealToxicityPrompts testbed.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: BERT has a moral direction.
Fig. 2: The MoralDirection approach rating the normativity of phrases.
Fig. 3: The MD-based detoxification approach is reducing the generated toxicity of neural language models.

Similar content being viewed by others

Data availability

The user study data are available at the code repository https://github.com/ml-research/MoRT_NMI/tree/master/Supplemental_Material/UserStudy. The generated text using the presented approach is available at https://hessenbox.tu-darmstadt.de/public?folderID=MjR2QVhvQmc0blFpdWd1YjViNHpz. The RealToxicityPrompts data are available at https://allenai.org/data/real-toxicity-prompts/.

Code availability

The code to reproduce the figures and results of this article, including pre-trained models, can be found at https://github.com/ml-research/MoRT_NMI (archived at https://doi.org/10.5281/zenodo.5906596).

References

  1. Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186 (2019).

  2. Peters, M. E. et al. Deep contextualized word representations. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Walker, M. A., Ji, H. & Stent, A.) 2227–2237 (Association for Computational Linguistics, 2018).

  3. Yang, Z. et al. Xlnet: Generalized autoregressive pretraining for language understanding. In Adv. Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds Wallach, H. M. et al.) 5754–5764 (2019).

  4. Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.) (2020).

  5. Next chapter in artificial writing. Nat. Mach. Intell. 2, 419 (2020).

  6. Goldberg, Y. Assessing BERT’s syntactic abilities. Preprint at https://arxiv.org/abs/1901.05287 (2019).

  7. Lin, Y., Tan, Y. & Frank, R. Open Sesame: Getting inside bert’s linguistic knowledge. In Proc. 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 241–253 (2019).

  8. Reif, E. et al. Visualizing and measuring the geometry of BERT. In Adv. Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (eds. Wallach, H. M. et al.) 8592–8600 (2019).

  9. Shwartz, V. & Dagan, I. Still a pain in the neck: Evaluating text representations on lexical composition. Trans. Assoc. Comput. Linguistics 7, 403–419 (2019).

    Article  Google Scholar 

  10. Tenney, I. et al. What do you learn from context? probing for sentence structure in contextualized word representations. In Proc. 7th International Conference on Learning Representations (OpenReview.net, 2019).

  11. Talmor, A., Elazar, Y., Goldberg, Y. & Berant, J. oLMpics - on what language model pre-training captures. Trans. Assoc. Computational Linguistics 8, 743–758 (2020).

    Article  Google Scholar 

  12. Roberts, A., Raffel, C. & Shazeer, N. How much knowledge can you pack into the parameters of a language model? In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 5418–5426 (Association for Computational Linguistics, 2020).

  13. Petroni, F. et al. Language models as knowledge bases? In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 2463–2473 (Association for Computational Linguistics, 2019).

  14. Doctor gpt-3: hype or reality? Nabla https://www.nabla.com/blog/gpt-3/ (Accessed 28 February 2021).

  15. Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (eds. Cohn, T., He, Y. & Liu, Y.) 3356–3369 (Association for Computational Linguistics, 2020).

  16. Abid, A., Farooqi, M. & Zou, J. Persistent anti-muslim bias in large language models. In Proc. AAAI/ACM Conference on AI, Ethics, and Society 298–306 (Association for Computing Machinery, 2021).

  17. Microsoft’s racist chatbot revealed the dangers of online conversation. IEEE Spectrum https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation (25 November 2019).

  18. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proc. ACM Conference on Fairness, Accountability, and Transparency (eds. Elish, M. C., Isaac, W. & Zemel, R. S.) 610–623 (2021).

  19. Hutson, M. Robo-writers: the rise and risks of language-generating AI. Nature 591, 22–56 (2021).

    Article  Google Scholar 

  20. Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).

    Article  Google Scholar 

  21. Jentzsch, S., Schramowski, P., Rothkopf, C. A. & Kersting, K. Semantics derived automatically from language corpora contain human-like moral choices. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES), 37-44 (2019).

  22. Schramowski, P., Turan, C., Jentzsch, S., Rothkopf, C. A. & Kersting, K. The moral choice machine. Front. Artif. Intell. 3, 36 (2020).

    Article  Google Scholar 

  23. Churchland, P. Conscience: The Origins of Moral Intuition (W. W. Norton, 2019).

  24. Christakis, N. A. The neurobiology of conscience. Nature 569, 627–628 (2019).

    Article  Google Scholar 

  25. Gert, B. & Gert, J. In The Stanford Encyclopedia of Philosophy Fall 2020 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2020).

  26. Alexander, L. & Moore, M. In The Stanford Encyclopedia of Philosophy Summer 2021 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2021).

  27. Bicchieri, C., Muldoon, R. & Sontuoso, A. In The Stanford Encyclopedia of Philosophy Winter 2018 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2018).

  28. Bolukbasi, T., Chang, K., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proc. Neural information Processing 4349–4357 (Curran Associates, 2016).

  29. Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing (2019).

  30. Cer, D. et al. Universal sentence encoder for English. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds. Blanco, E. & Lu, W.) 169–174 (Association for Computational Linguistics, 2018).

  31. Radford, A. et al. Language Models are Unsupervised Multitask Learners (2019).

  32. Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains and tasks. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J. R.) 8342–8360 (Association for Computational Linguistics, 2020).

  33. Dathathri, S. et al. Plug and play language models: A simple approach to controlled text generation. In Proc,. 8th International Conference on Learning Representations (OpenReview.net, 2020).

  34. Peng, X., Li, S., Frazier, S. & Riedl, M. Reducing non-normative text generation from language models. In Proc. 13th International Conference on Natural Language Generation 374–383 (Association for Computational Linguistics, 2020).

  35. Chen, M. X. et al. Gmail smart compose: Real-time assisted writing. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (eds. Teredesai, A. et al.) 2287–2295 (ACM, 2019).

  36. GPT-3 Powers the Next Generation of Apps. OpenAI https://openai.com/blog/gpt-3-apps/ (Accessed 22 January 2022).

  37. Forbes, M., Hwang, J. D., Shwartz, V., Sap, M. & Choi, Y. Social chemistry 101: Learning to reason about social and moral norms. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 653–670 (Association for Computational Linguistics, 2020).

  38. Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proc. International Joint Conference on Artificial Intelligence 2662–2670 (2017).

  39. Teso, S. & Kersting, K. Explanatory interactive machine learning. In Proc. AAAI/ACM Conference on AI, Ethics, and Society (2019).

  40. Schramowski, P. et al. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat. Mach. Intell. 2, 476–486 (2020).

    Article  Google Scholar 

  41. Berreby, F., Bourgne, G. & Ganascia, J.-G. Modelling moral reasoning and ethical responsibility with logic programming. In Logic for Programming, Artificial Intelligence, and Reasoning (eds. Davis, M., Fehnker, A., McIver, A. & Voronkov, A.) 532–548 (Springer, 2015).

  42. Pereira, L. M. & Saptawijaya, A. Modelling morality with prospective logic. Int. J. Reason. Based Intell. Syst. 1, 209–221 (2009).

    Google Scholar 

  43. Levine, S., Kleiman-Weiner, M., Schulz, L., Tenenbaum, J. & Cushman, F. The logic of universalization guides moral judgment. Proc. Natl Acad. Sci. USA 117, 26158–26169 (2020).

    Article  Google Scholar 

  44. Turney, P. D. & Pantel, P. From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010).

    Article  MathSciNet  Google Scholar 

  45. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Neural Information Processing Systems 3111–3119 (2013).

  46. Conneau, A., Kiela, D., Schwenk, H., Barrault, L. & Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In Proc. 2017 Conference on Empirical Methods in Natural Language Processing 670–680 (2017).

  47. Zhu, Y. et al. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In 2015 IEEE Int. Conf. Computer Vision 19–27 (IEEE Computer Society, 2015).

  48. Shafer-Landau, R. Ethical Theory: An Anthology Vol. 13 (John Wiley & Sons, 2012).

  49. Fassin, D. A Companion to Moral Anthropology (Wiley Online Library, 2012).

  50. Sumner, L. W. Normative ethics and metaethics. Ethics 77, 95–106 (1967).

    Article  Google Scholar 

  51. Katzenstein, P. et al. The Culture of National Security: Norms and Identity in World Politics. New Directions in World Politics (Columbia Univ. Press, 1996).

  52. Lindström, B., Jangard, S., Selbing, I. & Olsson, A. The role of a ‘common is moral’ heuristic in the stability and change of moral norms. J. Exp. Psychol. 147, 228–242 (2018).

    Article  Google Scholar 

  53. Hendrycks, D. et al. Aligning AI with shared human values. In Proc. Int. Conf. Learning Representations (OpenReview.net, 2021).

  54. Reif, E. et al. Visualizing and measuring the geometry of BERT. In Proc. Annu. Conf. Neural Information Processing Systems 8592–8600 (2019).

  55. Chen, B. et al. Probing BERT in hyperbolic spaces. In 9th Int. Conf. Learning Representations (2021).

  56. Chami, I., Gu, A., Nguyen, D. & Ré, C. Horopca: Hyperbolic dimensionality reduction via horospherical projections. In Proc. 35th Int. Conf. Machine Learning (2021).

  57. Kurita, K., Vyas, N., Pareek, A., Black, A. W. & Tsvetkov, Y. Measuring bias in contextualized word representations. In Proc. First Workshop on Gender Bias in Natural Language Processing 166–172 (Association for Computational Linguistics, 2019).

  58. Tan, Y. C. & Celis, L. E. Assessing social and intersectional biases in contextualized word representations. In Proc. Advances in Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (Wallach, H. M. et al.) 13209–13220 (2019).

  59. Zhang, Z. et al. Semantics-aware BERT for language understanding. In Proc. 34th AAAI Conference on Artificial Intelligence 9628–9635 (AAAI Press, 2020).

  60. Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable feedback. Further, the authors are thankful to Aleph Alpha for very useful feedback and access to the GPT-3 API. This work benefited from the ICT-48 Network of AI Research Excellence Center ‘TAILOR’ (EU Horizon 2020, grant agreement no. 952215) (K.K.), the Hessian research priority programme LOEWE within the project WhiteBox (K.K. and C.R.), and the Hessian Ministry of Higher Education, Research and the Arts (HMWK) cluster projects ‘The Adaptive Mind’ (K.K. and C.R.) and ‘The Third Wave of AI’ (K.K., C.R. and P.S.).

Author information

Authors and Affiliations

Authors

Contributions

P.S. and C.T. contributed equally to the work. P.S., C.T. and K.K. designed the study. P.S., C.T., C.R. and K.K. interpreted the data and drafted the manuscript. C.T. and N.A. designed the conducted user study. C.T. performed and analysed the user study. P.S. performed and analysed the text generation study. C.R. and K.K. directed the research and gave initial input. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Patrick Schramowski or Cigdem Turan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 BERT has a moral direction.

The direction is defined by a PCA computed on BERT-based sentence embeddings. The top PC, the moral direction m, is dividing the x axis into Dos and Don’ts. The displayed verbs were used to compute the PCA.

Extended Data Fig. 2 Overview of methods applied to investigate LMs mirrored moral values and norm.

(a) The LAMA framework with a prompt designed to analyse the moral values mirrored by the LM. (b) The question-answering approach and (c) our proposed MD approach. The BERT module is a placeholder for the LM.

Extended Data Fig. 3 Overview of participants of AMT user study.

(a) The participant’s location grouped by country and continent. (b) The age distribution and (c) the gender distribution. In total 234 volunteers participated in the study.

Supplementary information

Supplementary Information

Supplementary Sections A–D, Tables 1–7 and Figs 1–3.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schramowski, P., Turan, C., Andersen, N. et al. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat Mach Intell 4, 258–268 (2022). https://doi.org/10.1038/s42256-022-00458-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00458-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics