Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, GPT-2 and GPT-3. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many natural language processing tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show here that recent LMs also contain human-like biases of what is right and wrong to do, reflecting existing ethical and moral norms of society. We show that these norms can be captured geometrically by a ‘moral direction’ which can be computed, for example, by a PCA, in the embedding space. The computed ‘moral direction’ can rate the normativity (or non-normativity) of arbitrary phrases without explicitly training the LM for this task, reflecting social norms well. We demonstrate that computing the ’moral direction’ can provide a path for attenuating or even preventing toxic degeneration in LMs, showcasing this capability on the RealToxicityPrompts testbed.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Ethical and methodological challenges in building morally informed AI systems
AI and Ethics Open Access 22 June 2022
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
Prices may be subject to local taxes which are calculated during checkout
The user study data are available at the code repository https://github.com/ml-research/MoRT_NMI/tree/master/Supplemental_Material/UserStudy. The generated text using the presented approach is available at https://hessenbox.tu-darmstadt.de/public?folderID=MjR2QVhvQmc0blFpdWd1YjViNHpz. The RealToxicityPrompts data are available at https://allenai.org/data/real-toxicity-prompts/.
The code to reproduce the figures and results of this article, including pre-trained models, can be found at https://github.com/ml-research/MoRT_NMI (archived at https://doi.org/10.5281/zenodo.5906596).
Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies 4171–4186 (2019).
Peters, M. E. et al. Deep contextualized word representations. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Walker, M. A., Ji, H. & Stent, A.) 2227–2237 (Association for Computational Linguistics, 2018).
Yang, Z. et al. Xlnet: Generalized autoregressive pretraining for language understanding. In Adv. Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds Wallach, H. M. et al.) 5754–5764 (2019).
Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS) (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.) (2020).
Next chapter in artificial writing. Nat. Mach. Intell. 2, 419 (2020).
Goldberg, Y. Assessing BERT’s syntactic abilities. Preprint at https://arxiv.org/abs/1901.05287 (2019).
Lin, Y., Tan, Y. & Frank, R. Open Sesame: Getting inside bert’s linguistic knowledge. In Proc. 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 241–253 (2019).
Reif, E. et al. Visualizing and measuring the geometry of BERT. In Adv. Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (eds. Wallach, H. M. et al.) 8592–8600 (2019).
Shwartz, V. & Dagan, I. Still a pain in the neck: Evaluating text representations on lexical composition. Trans. Assoc. Comput. Linguistics 7, 403–419 (2019).
Tenney, I. et al. What do you learn from context? probing for sentence structure in contextualized word representations. In Proc. 7th International Conference on Learning Representations (OpenReview.net, 2019).
Talmor, A., Elazar, Y., Goldberg, Y. & Berant, J. oLMpics - on what language model pre-training captures. Trans. Assoc. Computational Linguistics 8, 743–758 (2020).
Roberts, A., Raffel, C. & Shazeer, N. How much knowledge can you pack into the parameters of a language model? In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 5418–5426 (Association for Computational Linguistics, 2020).
Petroni, F. et al. Language models as knowledge bases? In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 2463–2473 (Association for Computational Linguistics, 2019).
Doctor gpt-3: hype or reality? Nabla https://www.nabla.com/blog/gpt-3/ (Accessed 28 February 2021).
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (eds. Cohn, T., He, Y. & Liu, Y.) 3356–3369 (Association for Computational Linguistics, 2020).
Abid, A., Farooqi, M. & Zou, J. Persistent anti-muslim bias in large language models. In Proc. AAAI/ACM Conference on AI, Ethics, and Society 298–306 (Association for Computing Machinery, 2021).
Microsoft’s racist chatbot revealed the dangers of online conversation. IEEE Spectrum https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation (25 November 2019).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proc. ACM Conference on Fairness, Accountability, and Transparency (eds. Elish, M. C., Isaac, W. & Zemel, R. S.) 610–623 (2021).
Hutson, M. Robo-writers: the rise and risks of language-generating AI. Nature 591, 22–56 (2021).
Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).
Jentzsch, S., Schramowski, P., Rothkopf, C. A. & Kersting, K. Semantics derived automatically from language corpora contain human-like moral choices. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES), 37-44 (2019).
Schramowski, P., Turan, C., Jentzsch, S., Rothkopf, C. A. & Kersting, K. The moral choice machine. Front. Artif. Intell. 3, 36 (2020).
Churchland, P. Conscience: The Origins of Moral Intuition (W. W. Norton, 2019).
Christakis, N. A. The neurobiology of conscience. Nature 569, 627–628 (2019).
Gert, B. & Gert, J. In The Stanford Encyclopedia of Philosophy Fall 2020 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2020).
Alexander, L. & Moore, M. In The Stanford Encyclopedia of Philosophy Summer 2021 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2021).
Bicchieri, C., Muldoon, R. & Sontuoso, A. In The Stanford Encyclopedia of Philosophy Winter 2018 edn (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford University, 2018).
Bolukbasi, T., Chang, K., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proc. Neural information Processing 4349–4357 (Curran Associates, 2016).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing (2019).
Cer, D. et al. Universal sentence encoder for English. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds. Blanco, E. & Lu, W.) 169–174 (Association for Computational Linguistics, 2018).
Radford, A. et al. Language Models are Unsupervised Multitask Learners (2019).
Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains and tasks. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J. R.) 8342–8360 (Association for Computational Linguistics, 2020).
Dathathri, S. et al. Plug and play language models: A simple approach to controlled text generation. In Proc,. 8th International Conference on Learning Representations (OpenReview.net, 2020).
Peng, X., Li, S., Frazier, S. & Riedl, M. Reducing non-normative text generation from language models. In Proc. 13th International Conference on Natural Language Generation 374–383 (Association for Computational Linguistics, 2020).
Chen, M. X. et al. Gmail smart compose: Real-time assisted writing. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (eds. Teredesai, A. et al.) 2287–2295 (ACM, 2019).
GPT-3 Powers the Next Generation of Apps. OpenAI https://openai.com/blog/gpt-3-apps/ (Accessed 22 January 2022).
Forbes, M., Hwang, J. D., Shwartz, V., Sap, M. & Choi, Y. Social chemistry 101: Learning to reason about social and moral norms. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 653–670 (Association for Computational Linguistics, 2020).
Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proc. International Joint Conference on Artificial Intelligence 2662–2670 (2017).
Teso, S. & Kersting, K. Explanatory interactive machine learning. In Proc. AAAI/ACM Conference on AI, Ethics, and Society (2019).
Schramowski, P. et al. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat. Mach. Intell. 2, 476–486 (2020).
Berreby, F., Bourgne, G. & Ganascia, J.-G. Modelling moral reasoning and ethical responsibility with logic programming. In Logic for Programming, Artificial Intelligence, and Reasoning (eds. Davis, M., Fehnker, A., McIver, A. & Voronkov, A.) 532–548 (Springer, 2015).
Pereira, L. M. & Saptawijaya, A. Modelling morality with prospective logic. Int. J. Reason. Based Intell. Syst. 1, 209–221 (2009).
Levine, S., Kleiman-Weiner, M., Schulz, L., Tenenbaum, J. & Cushman, F. The logic of universalization guides moral judgment. Proc. Natl Acad. Sci. USA 117, 26158–26169 (2020).
Turney, P. D. & Pantel, P. From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Neural Information Processing Systems 3111–3119 (2013).
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. & Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In Proc. 2017 Conference on Empirical Methods in Natural Language Processing 670–680 (2017).
Zhu, Y. et al. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In 2015 IEEE Int. Conf. Computer Vision 19–27 (IEEE Computer Society, 2015).
Shafer-Landau, R. Ethical Theory: An Anthology Vol. 13 (John Wiley & Sons, 2012).
Fassin, D. A Companion to Moral Anthropology (Wiley Online Library, 2012).
Sumner, L. W. Normative ethics and metaethics. Ethics 77, 95–106 (1967).
Katzenstein, P. et al. The Culture of National Security: Norms and Identity in World Politics. New Directions in World Politics (Columbia Univ. Press, 1996).
Lindström, B., Jangard, S., Selbing, I. & Olsson, A. The role of a ‘common is moral’ heuristic in the stability and change of moral norms. J. Exp. Psychol. 147, 228–242 (2018).
Hendrycks, D. et al. Aligning AI with shared human values. In Proc. Int. Conf. Learning Representations (OpenReview.net, 2021).
Reif, E. et al. Visualizing and measuring the geometry of BERT. In Proc. Annu. Conf. Neural Information Processing Systems 8592–8600 (2019).
Chen, B. et al. Probing BERT in hyperbolic spaces. In 9th Int. Conf. Learning Representations (2021).
Chami, I., Gu, A., Nguyen, D. & Ré, C. Horopca: Hyperbolic dimensionality reduction via horospherical projections. In Proc. 35th Int. Conf. Machine Learning (2021).
Kurita, K., Vyas, N., Pareek, A., Black, A. W. & Tsvetkov, Y. Measuring bias in contextualized word representations. In Proc. First Workshop on Gender Bias in Natural Language Processing 166–172 (Association for Computational Linguistics, 2019).
Tan, Y. C. & Celis, L. E. Assessing social and intersectional biases in contextualized word representations. In Proc. Advances in Neural Information Processing Systems 32: Annu. Conf. Neural Information Processing Systems (Wallach, H. M. et al.) 13209–13220 (2019).
Zhang, Z. et al. Semantics-aware BERT for language understanding. In Proc. 34th AAAI Conference on Artificial Intelligence 9628–9635 (AAAI Press, 2020).
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).
The authors thank the anonymous reviewers for their valuable feedback. Further, the authors are thankful to Aleph Alpha for very useful feedback and access to the GPT-3 API. This work benefited from the ICT-48 Network of AI Research Excellence Center ‘TAILOR’ (EU Horizon 2020, grant agreement no. 952215) (K.K.), the Hessian research priority programme LOEWE within the project WhiteBox (K.K. and C.R.), and the Hessian Ministry of Higher Education, Research and the Arts (HMWK) cluster projects ‘The Adaptive Mind’ (K.K. and C.R.) and ‘The Third Wave of AI’ (K.K., C.R. and P.S.).
The authors declare no competing interests.
Peer review information
Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 BERT has a moral direction.
The direction is defined by a PCA computed on BERT-based sentence embeddings. The top PC, the moral direction m, is dividing the x axis into Dos and Don’ts. The displayed verbs were used to compute the PCA.
Extended Data Fig. 2 Overview of methods applied to investigate LMs mirrored moral values and norm.
(a) The LAMA framework with a prompt designed to analyse the moral values mirrored by the LM. (b) The question-answering approach and (c) our proposed MD approach. The BERT module is a placeholder for the LM.
Extended Data Fig. 3 Overview of participants of AMT user study.
(a) The participant’s location grouped by country and continent. (b) The age distribution and (c) the gender distribution. In total 234 volunteers participated in the study.
Supplementary Sections A–D, Tables 1–7 and Figs 1–3.
Rights and permissions
About this article
Cite this article
Schramowski, P., Turan, C., Andersen, N. et al. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat Mach Intell 4, 258–268 (2022). https://doi.org/10.1038/s42256-022-00458-8
This article is cited by
Ethical and methodological challenges in building morally informed AI systems
AI and Ethics (2022)