AIs built on Large Language Models have wowed by producing particularly fluent text. However, their ability to do this is limited in many languages. As the data and resources used to train a model in a specific language drops, so does the performance of the model, meaning that for some languages the AIs are effectively useless.
Why ChatGPT can't handle some languages
Researchers are aware of this problem and are trying to find solutions, but the challenge extends far beyond just the technical, with moral and social questions to be answered. This podcast explores how Large Language Models could be improved in more languages and the issues that could be caused if they are not.
Never miss an episode. Subscribe to the Nature Podcast on Apple Podcasts, Spotify, YouTube Music or your favourite podcast app. An RSS feed for the Nature Podcast is available too.
TRANSCRIPT
Nick Petrić Howe
ChatGPT has a language problem.
Asmelash Teka Hadgu
So my native language is called Tigrinya. It's spoken in Tigray in Ethiopia, and also in Eritrea.
Nick Petrić Howe
This is Asmelash Teka Hadgu, an AI researcher at the Distributed AI Research Institute, and founder of a language technology startup focused on African Languages. In the past he and his colleagues have looked at how ChatGPT performs for certain African languages, like Tigrinya.
Asmelash Teka Hadgu
And we ran a series of very small experiments and what we discovered was basically, at least for these languages, the result is total gibberish.
Nick Petrić Howe
To probe deeper, I asked Asmelash to try out some prompts for me. I should mention that he was using the more advanced paid for GPT4 powered version of ChatGPT as well. Here’s how ChatGPT responded when asked to list sports in Tigrinya.
Asmelash Teka Hadgu
If you are a speaker of Tigrinya and if you look at these things, there are some Tigrinya things: <Tigrinyan speech>. If you asked me what this sentence means, I would have a hard time telling you anything. And in particular here for example, football/soccer here it says <Tigrinyan speech> literally it says, ‘leg mind’. That's not soccer, soccer is <Tigrinyan speech>.
Nick Petrić Howe
Now I’ve heard of keeping your head in the game in football, but “leg mind” is probably a bit of a stretch. In the end, Amelash tried around 30 different prompts for me in Tigrinya and he found similar oddities throughout.
Asmelash Teka Hadgu
Here, the prompt was like list examples of European countries <Tigrinyan speech>. So, it has the pattern of just repeating anything you give it here. So whatever is there, so for example, <Tigrinyan speech> it was in the thing, so whatever was on the prompt, it kind of repeats these words. So we have a mention of sport here even though we're not talking about sport. Remember I said <Tigrinyan speech> before, like leg mind, you know, made up words that doesn't exist in the language. It appears here with reference to Germany.
Nick Petrić Howe
Only one prompt he tried returned a useful answer to Asmelash. When presented with Tigrinyan-text ChatGPT correctly identified it as Tigrinyan, but for all the other prompts when it was trying to respond in Tigrinya, it generated completely nonsensical answers. Unsurprisingly, Asmelash didn’t think it would be useful for speakers of Tigrinya.
Asmelash Teka Hadgu
It would be a waste of your time. I mean, it's just generating nonsense. So, I wouldn't recommend anyone using ChatGPT to do anything useful. Because it doesn’t at this moment, for these languages.
<music>
Nick Petrić Howe
ChatGPT and chatbots like it are taking the world by storm. But if you’re not using English, chances are that you are having a worse experience. And that is a problem. You might think of these technologies as just tools unaffected by our unobjective humanness, but in reality they are anything but. The ‘language’ part of the large language model, fundamentally impacts how they can function, because language is intrinsically linked with culture, history, belief, identity, wellbeing, inequity and progress. And as these chatbots become more and more integrated in our lives, we need to confront some uncomfortable truths - about how we design this technology, how it could be used and the impacts it could have. In this podcast we are going to explore the relationship between LLMs and languages and ask what must be done to ensure that AIs work for everyone.Pascale Fung
I think everybody in the community agrees that we need to democratise AI. There should not be disproportionate benefit in one language versus the other, right. So we want fair access, you know, and we want to empower communities in different languages.<music>
Nick Petrić Howe
The problem Asmelash describes at the start of this podcast is well known in the field of Large Language Models or LLMs, the models that form the basis of chatbots like ChatGPT. Tigrinya is what is referred to as a ‘low-resource language’. A definition used in the field that doesn’t necessarily reflect how widely spoken a language is, but instead is more to do with how much high-quality data is available for that language to train an LLM. For example, Bengali is often considered a low-resource language, even though it has hundreds of millions of speakers. It just doesn’t have as much of the high-quality data scraped from the internet as English does. In general, the lower the resource the language, the worse the performance. This means that ChatGPT will do amazingly well in English, the highest resource language (by quite a margin), but poorly for other languages. Now it is not that the big chatbots don’t include languages other than English, ChatGPT for instance will boldly state its abilities in other languages and respond to prompts in those languages. It’s just not been the focus, and the performance reflects that.
Thien Nguyen
The problem here is that when people start using this model in different writing tasks, they tend to adopt the bias, they tend to adopt the kind of styles that English speaker would have when they think about or write about this problem.
Nick Petrić Howe
Thien Nguyen, an AI researcher from the University of Oregon, worries that the different levels of linguistic ability in chatbots could be causing a loss of diversity of thought. Something which could be of big consequence to science.
Thien Nguyen
And I think because of these kinds of standardisation, so using English, for different tasks, we somehow miss the innovations, we somehow miss the diverse interpretations of the same scientific results. When we go into the publications, I think there's tremendous value in having a foreign speaker to write about some findings, in English, using their own style, in their own perception, using their own identities, using their own kind of thinking and mindset. And that basically provides a lot of benefits for the scientific discovery because we have a diverse views, about different topics.Nick Petrić Howe
LLMs are already used quite a lot in science to help with writing, creating code or even as a digital secretary. ChatGPT and its LLM cousins were rated the most impressive or useful examples of AI in a Nature survey. One of things that researchers thought would be the most useful would be to help people whose first language isn’t English. And whilst Chatbots can help here, as Thien argues you could miss out on a wider diversity of thought if things are only viewed through an English-lens. Now, there are technical solutions to building LLMs in different languages. The obvious one is also one of the hardest to achieve: getting more data to train the models. Given how much of the internet is devoted to English there often just isn’t the data there to access.
Thien Nguyen
The amount of data that you can collect for these different language wouldn't be enough to train large language models to perform well. And the reason is, because again, we already have the dominance of English in our current publication. So even when you try to go to internet and you try to collect as much data as you can, in different sources, you actually wouldn't be able to find sufficient training data. And I mean, we did some of their large-scale data collection efforts recently for like, more than 150 language. And basically, you can see that between English and the second largest language in our layer is like a big gap. So I think just going out there and collecting more datas for these different language wouldn't be the sustainable solution to train, like, high performance, large language model.
Nick Petrić Howe
Even in English, AI companies are already starting to struggle to get enough data for these hungry training models. In fact, there’s growing controversy over the potential use of data scraped from YouTube and other copyrighted sources to train models, a step taken as allegedly they had already burned through all the useful data. So what is the alternative? Well, Thien suggests that some of the existing English large language models could be leveraged to build models in other languages. This is a process that’s known as transfer learning.
Thien Nguyen
I think the way to go is actually to think about how we transfer the knowledge from English to a different language, so that we don't need to use that much training data for this new language and still have a good performance. And here, basically, you're thinking about starting with some of the existing language model that we have for English, for instance. And then we try to adapt that to a new language by introducing transfer learning techniques, and leveraging just some reasonable amounts of data for the target language.
Nick Petrić Howe
The idea here is to start with an English model that has information in it that applies to all languages and then mould it into one that can power a chatbot in your target language. For example there’ll be linguistic information in the English model, like language is composed of words and sentences which convey meaning. But they’ll also be the sort of knowledge you want to build a helpful chatbot — information about science, facts and history that would be applicable to everyone. As there’s so much English data out there, that these models have been trained on, it makes sense to utilise that, rather than training a new model from scratch in the new language. With all this information you’d then train and fine-tune that model using some data for your target language. Here, you’d be wanting to train it to understand the meaning of words in the new language and giving it the sort of cultural information that would be useful. For example, if you were building a model for Vietnamese, you’d want to train it not only with Vietnamese language with Vietnamese history and cultural references. While this is technically possible, transfer learning techniques do still require quite a bit of training data, which means it’s unclear whether they will work for languages that are very low resource. But it still can be effective. In fact, many researchers have done this for different languages, including Thien, who’s made a LLM for Vietnamese. And he thinks this new model works much better than ChatGPT in this language.
Thien Nguyen
It understand the slang in Vietnamese much better than that in English, right? Because this is built based on the data that we have for Vietnamese. And certainly it can answer questions for Vietnamese histories, Vietnamese geography much better than ChatGPT, because actually it has a local knowledge about the countries and the people. It's more like a native speaker in Vietnamese.
<music>
Nick Petrić Howe
But even assessing how well a model works in different languages is a challenge, and the English bias is central to that problem too. Scientists use standardised questions called benchmarks to quantify the performance of their models. And yet…
Alham Fikri Aji
Majority of benchmark data is also in English.
<music>
Nick Petrić Howe
Alham Fikri Aji, is from the Mohamed bin Zayed University of Artificial Intelligence. He studies multilingual large language models and told me about the issues of benchmarks.
Alham Fikri Aji
So for example, common benchmark data is called MMLU. This is like science questions, right? Or there’s like, called GSM8K, this is also a benchmark data, like math questions. So you have these types of questions. But again, the same problem. They are in English.
Nick Petrić Howe
These benchmarks are often questions that probe at an LLMs abilities. They range from logic puzzles to high school biology to macroeconomics. For example, one question on the MMLU that Alham mentioned is simply: “As of 2017, how many of the world’s 1-year-old children today have been vaccinated against some disease?” The LLM is offered 4 options, 80, 60, 40 and 20% with the correct answer being 80.
Alham Fikri Aji
Now, how do people benchmark multilingual data? They take this benchmark, and then they usually translate them. So we have this multilingual benchmark.
Nick Petrić Howe
This translation can cause problems as there’s never a perfect translation from one language to another. Translation itself is a bit of an art form. Words don’t always have a one-to-one translation from English — they can depend on context or may be untranslatable altogether. The previous example was pretty straightforward, but some of the other questions on the benchmark are far more complicated. They involve lots of historical context or complicated scientific terminology that may not have an equivalent in the target language. Through translation as well, you may get other biases and things tangled up in there depending on how you interpret the text. Many of the questions also are pretty US-centric too. For example, one question is: “Which of the following best states an argument made by James Madison in the Federalist number 10?” The correct answer is of course “The negative effects of factionalism can be reduced by a republican government”. For people in the US that may be relatively obvious, but for me, a Brit, despite speaking English as my first language, I had no idea for this one. My knowledge of the Federalist papers only runs as deep as Hamilton.
Alham Fikri Aji
So when we ask, what is the performance of multilingual language model? Frankly, we don't know because the benchmark itself is a problem. You translate the benchmark, it does not cover all languages.
Nick Petrić Howe
Now one solution to this is to allow the people who would use the model to assess it. This is kind of what happened when ChatGPT came out and millions of people started using it, finding out what it could and what it couldn’t do. But if in an ideal world you wanted a LLM that could converse in multiple languages, well that’s a tricky problem to solve. But why would you want a multilingual LLM? Well for a start, many people are multilingual and would benefit from one, or a single model could serve many people from diverse communities, or it could help with translation. Another thing though is that making a model multilingual may actually be a method to improve their performance when training data is lacking. In those cases, Alham’s research has shown that adding more languages to the mix, especially closely related ones, actually makes the LLM better. But that only works when the model isn’t the best to begin with. If you have an LLM that already works very well, for English say, then adding in further languages actually makes it worse.
Alham Fikri Aji
So let's say I train an English only model, it is very good. But then let's say I train now, like multilingual models, including English, when you compare the English one, usually the pure English one is better. It is, actually because English is one of the very high resource languages. However, when we talk about the low resource one, they are actually getting a benefit of having other languages there. Let's say maybe Sudanese is a very low resource language, let's say if you train Sudanese only model, it's usually not as good as if you train multilingual models, including Sundanese. So it's kind of like balanced out, I guess, the performance, right. So you improve the low risk resource performance by sacrificing the high resource ones.
Nick Petrić Howe
It’s unclear why this is. It’s possible that by introducing more different languages into a data set it could help build models for very low-resource languages. But let’s say that one, or all, or a combination of these technical solutions are used, and you end up with a well-performing language model. Now you reach a very tricky part of the LLMs development — getting humans involved.
<music>
Pascale Fung
There's a large part of the work is called something called reinforcement learning with human feedback.
Nick Petrić Howe
This is Pascale Fung. She studies conversational AI at Hong Kong University of Science and Technology.
Pascale Fung
And this is what got ChatGPT to perform really well compared to its predecessor, GPT3. And this process, this reinforcement learning with human feedback requires a lot of human feedback. So that includes both user feedback, and also people they hire, to give feedback to rate the answers and to show the preferences.
Nick Petrić Howe
And given that lots of these models are made to work in English, so too is lots of the human-reinforcement.
Pascale Fung
And this is a very expensive process, you know, you hire people to do it. So, most of the LLM today, you know, out of American companies tend to focus that process in English.
Nick Petrić Howe
And herein lies a problem. Humans don’t really agree on anything, so how do you build an LLM that works for everyone?
Pascale Fung
So human preference is not just linguistic, right? Because human preferences reflect of each language is, is basically the value of that culture. So it's not just a linguistic issue is a cultural issue. And more important, is the value issue.
<music>
Pascale Fung
So today, for example, LLM can translate very well. So should we just use all the English values and translate to our languages? That's obviously not a good approach, right? Because, you know, different cultures might have different values.
Nick Petrić Howe
Language in many ways is a reflection of our cultures, so simple translation can be problematic. For example, a dove is often thought of as a symbol of peace, but the Basque word for dove is an insult. A simple, direct translation in the context of, say, a peace negotiation, could quickly become quite problematic.
Even within the same language, it’s not exactly like our values align. In English, the word abortion to one person can mean a medical procedure, to another it can mean murder. Getting humans to agree on values is a challenge that’s faced us ever since we started using languages and there are no easy solutions to it, for Large Language Models or otherwise. But in an increasingly globalised world, with geopolitical tensions building - the cultural concerns here extend beyond translation mishaps and cultural misunderstandings.
OpenAI makes no secret of the fact that its models have a US-centric point of view. And a recent study has shown that, indeed, ChatGPT promotes such values. In the study the researchers probed ChatGPT using something called the Hofstede Cultural Survey, a commonly used tool to analyse alignment with culture that assesses things like individualism, masculinity and indulgence. They found that ChatGPT had the strongest correlation with US culture, even when asked about the viewpoints of people from other parts of the world. And that means that performance and accuracy are not the only concerns. For example, ChatGPT actually performs quite well in Mandarin due to a relatively large amount of training data available, but in China ChatGPT is censored. There are concerns that ChatGPT could act like US propaganda. And so, instead of relying on such ‘American’ models many Chinese companies are focusing on developing their own LLMs.
Jeff Ding
There is this push from the Chinese side to try to develop their own independent AI models.
Nick Petrić Howe
This is Jeff Ding, who studies emerging Chinese technologies at The George Washington University, in the US.
Jeff Ding
It's not just about being independent from a material standpoint. And from a strategic standpoint, I think there's also some level of being independent in this cutting-edge technology as a way to show that you have soft power, you have this strong reputation as a scientific leader.
Nick Petrić Howe
However, despite such ambitions currently the models developed in China are a way behind the big players in the US. Growing tensions have meant the US is restricting access to resources Chinese competitors could use, but it’s also due to cultural differences — like Chinese censorship. LLMs need massive amounts of data to learn from, but the Chinese authorities have proposed restrictions on what data can be used and each would-be LLM needs approval before they can be released. And that has meant that China has taken a fair bit longer for their models to get out of the gate.
Jeff Ding
For me, the main issue posed by censorship was it really slowed down the ability of Chinese LLM developers to release their models out into the general public.
Nick Petrić Howe
To prevent LLMs learning a view of the world which didn’t align with the goals of the censors, legislation has been introduced to strictly regulate LLM research and use.
Jeff Ding
And there were these interim generative AI regulations put in place that impose very, very strict regulations on any Chinese LLMs that were aimed at the entire public or had, quote, public opinion properties, unquote. So that instituted around a year delay in terms of the ability of Chinese LLM providers to release their models out into the wild. ChatGPT and other Western models got so much of an advantage from that, because they could benefit from everybody testing and playing around with their models. They also generated a huge amount of demand for their models, and a huge amount of recognition. So I see that as the Chinese government's concerns about censorship, slowing down, the release of these models, out in the wild created a huge delay in terms of Chinese labs ability to compete.
Nick Petrić Howe
Now this hasn’t stopped models being developed in China. Baidu, a huge technology company, has developed ERNIE, which the company says has attracted 200 million users. However, it’s focused on the Chinese market and abides by Chinese regulations. Try to ask certain questions that would touch on censored information and it will likely say, “let’s talk about something else”. Whilst that may seem a bit odd, it is not unusual for other chatbots either, it’s just the motivation and context that changes — ask ChatGPT or other chatbots to generate hate speech and they will (likely) refuse. But it could limit the utility of such a model for people who don’t want such censorship in what they’re generating.
<music>
Nick Petrić Howe
The point is that building a chatbot that works well in Chinese languages, is not just a case of accessing the right training data, or benchmarking data, or training protocols. To create a chatbot that works well in Chinese languages, you need to think about and define what ‘working well in Chinese’ actually means — culturally — what it means to the Chinese people, the Chinese government and in each use case. The fact is that the goal posts can move, and languages are at the heart of the problem.
<music>
Nick Petrić Howe
Ultimately, Jeff believes that ultimately China and Chinese companies will want to build LLMs that can be used worldwide, and that can compete with the other big players, right now though, domestic concerns are leading them to focus on the Chinese market. Meanwhile there are many other nations, cultures and languages that want LLMs now, and would be willing to import them, but not from the US. Their cultural sensitivities may not align, for example. Even in places that seem fairly aligned like the EU, there are worries that US-based AIs could overwhelm local cultures and languages, and the two regions have very different approaches to how such AI should be regulated.
Yong Lim
So there's that sort of fear that AI technology if you don't make it right, if you don't do it properly, you're going to be sort of beholden to countries or governments that have better technology or industry to work with.
Nick Petrić Howe
This is Yong Lim, director of Seoul National Universities AI initiative. As we have heard, this kind of fear has driven several regions, like China and the EU, to build their own LLMs separate from US influence. But Yong says that Korean companies have seen another opportunity. What about places in the world which may have a desire for their own LLMs free of US influence, but without the infrastructure or resources to build such models? Well it is here that he says Korean companies are making a play.
Yong Lim
You know, Korean companies have realised that it's very difficult to simply compete based on the size of the model itself, or general LLMs, in that sense. So they've sort of made a play to the market saying that, well, we're going to try to create LLMs that are more refined, or fine-tuned towards your needs.
Nick Petrić Howe
For example, positioning themselves outside of the US and Chinese hegemony, Korean tech giant Naver has teamed up with Aramco, a large Saudi Arabian oil firm, to create LLMs for Arabic languages. There are also Korean companies that are looking to build LLMs for Brazil, the Philippines and Malaysia. Korea has one the highest rates of internet access and usage in the world, meaning that a huge amount of high-quality data is being generated by its well-connected population, perfect for training a would-be Korean chatbot. And several have been built that seem to work very well, even despite some of the complexities of the language. So the idea that some Korean companies are pushing is that if we can do this for Korean, why not for your language or your culture?
Yong Lim
Maybe we'll be able to make it work for you in your language, right? So we're intent on doing that, you know, we're not going to be simply English or Western centric. But look at what we've done, we can make it happen, we've actually are, you know, doing a pretty good job, you know, we may not be at the cutting edge, like a forefront or really frontier. But you know, we're really right there. You know, our models are good enough to work well. And so why don't you give us an opportunity?
Nick Petrić Howe
Now, according to Yong, we don’t yet know if it will ultimately pay off. But it shows how concerns about LLMs being compatible with cultures and language may drive the markets in the future. Whilst US-based companies may continue to focus on building general purpose and widely usable LLMs, companies in other parts of the world may focus on their own market or look to places where LLMs may be desired, but perhaps they are unable to build them themselves.
<music>
Nick Petrić Howe
Pascale and Thien, who you heard from earlier, have been building models for Indonesian and Vietnamese. But their models are open and available to use, rather than being part of a company's strategy, as they are interested in making sure that these technologies are accessible to everyone. This means basing their models on more open models like Mistral 7B and Meta’s LLaMA, and then seeking the involvement of the people who would use it, to make sure that it really reflects their culture. Here’s Pascale.
Pascale Fung
My research group has been collaborating with Indonesian universities and teams to build Indonesian LLMs, in different Indonesian languages. So that, you know, the indigenous community, the indigenous Indonesian speaking populations will have control of the LLM. So it's not just like, we build it, and they use it, but that we build it together with them.
Nick Petrić Howe
And this work is quite fundamental, it isn’t just about the use of culturally specific linguistic features like idioms, or even how words might be interpreted differently in different languages and cultures. No. Pascale’s work goes all the way back to fundamental moral and ethical values.
Pascale Fung
So currently, my team were researching on that question, which is, you know, under what circumstances should there be a value translation? Under what circumstances should there be an indigenous value adoption? Should there be a mixture of both because some values are universal? Maybe we don't need to collect over and over again, values such as, you know, you shouldn't kill somebody for money. But you know, there are other things that are very culturally specific, then we'll need the indigenous community to contribute.
Nick Petrić Howe
Asmelash, who we heard from at the start of this podcast, goes further asking whether LLMs are always the right solution to people’s problems.
Asmelash Teka Hadgu
Is yet another language model what we need? So I would always, I actually, you know, question what kind of technologies we need. I think it's always better to start with the needs of the community than, you know, the technology first, and then kind of like bring it to them. So in, you know, communities like Tigray, there is a very low literacy rate. So people can't, for example, write in their language. So you would imagine that like other technologies, maybe for example, speech recognition would be more useful than large language models.
Nick Petrić Howe
Ultimately, if we view technology as a tool then it should serve people and they should be able to access it - regardless of their background or cultural context. And if Large Language Models are going to play a significant part in that world in the way that their promoters claim, then it is clear that language plays a central part, and so everyone, regardless of their language, should be able to access them.
<music>
Nick Petrić Howe
We need to ensure we don’t allow the incredible performance of LLMs in English to blind us to how well it works elsewhere. How exactly we make these LLMs available in more languages and for more cultures is a challenge for sure, but it’s a solvable one.
<music>
Pascale Fung
I do hope to see a LLM for everyone, that we should not exclude any language. Of course, there are like 6000 languages in the world. Can we get to 6000 languages? Why not? There are scientific solutions, they’re challenges, but if we put our minds to it, collectively, we can do it. And then… well I would say that is not just technical, right? It's socio-technical challenge, right? But, you know, we just starting this conversation, so we need to work on this together. Will there be a perfect solution? Probably not, because human society doesn't have perfect solutions for value alignment, we can’t align values between different societies, but we continue to have the conversation. And for LLMs, I think it is very important that it's open. And that, you know, indigenous communities from all this, you know, speakers of these languages participate, not just in using the LLM, but in building them, fine tuning them, and in doing the human feedback. So I think it should be LLM for everyone, but also everyone on LLM.
<music>
Thien Nguyen
I think large language model will be just a fundamental technologies that becomes a part of our daily life in the future. Because certainly even right now we already see a lot of benefits: it is very helpful for writing assistant, content creation or suggestion. But I think it in the in the longer futures I think these large language model it has to be integrated into diverse modalities and application to be really helpful for people's different ways.
<music>
Alham Fikri Aji
I'm very optimistic, right? Because in terms of, let's say, community, seen more, you know, multilingual community, working on their own regions and their own languages, they are building their own datasets, they are building their own models. So I think in the future, then we can have more support on more languages.
<music>
Asmelash Teka Hadgu
There's a kind of movement that excites me and that I am part of, especially in the African natural language processing space. And this is where like different organisations and startups are coming together in a distributed manner. So we all have like different languages that we work with different communities we serve. But we're also trying to come together and kind of have a unified kind of solution for businesses and people who want to use like the larger African market, that excites me, because I think the technologies each of these organisations, is building is more powerful than, you know, the monolithic, what we know of like the big tech companies are doing. So this is just exciting by itself. But second, they're solving concrete problems for communities they deeply care about. And I think, if that kind of gains traction, and if more creative, and you know, genuinely kind of smart people are involved there. It excites me to see like some positive future, you know, because we're taking this into our own hands, and we're not waiting for somebody else to come in and solve this problem for us.