Scaling neural machine translation to 200 languages

The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality. However, scaling quality NMT requires large volumes of parallel bilingual data, which are not equally available for the 7,000+ languages in the world1. Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind—a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture2–7, which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. We evaluated the performance of our model over 40,000 translation directions using tools created specifically for this purpose—an automatic benchmark (FLORES-200), a human evaluation metric (XSTS) and a toxicity detector that covers every language in our model. Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system.


B NLLB-Seed dataset
Machine learning is notoriously data-hungry, leading to many research areas aimed at reducing the amount of required supervision.Recent advances in zero-shot learning [5, 64, 73, 74] and self-supervised learning [75][76][77], for instance, seek to reduce this reliance.However, generation tasks like translation are unlikely to reach the desired quality levels without some starter data.For instance, producing a good translation without seeing a minimum number of sentences in a new language is challenging.Similarly, it may be difficult to classify which language a sentence is in without seeing reliable examples of text in different languages.To this end, we create NLLB-Seed, a set of professionally translated sentences in the Wikipedia domain.NLLB-Seed comprises around six thousand sentences in 39 languages.
Such a data set has numerous potential uses.For instance, NLLB-Seed's targetside data in various languages can be deployed for language identification model building.The data set can also be used for its aligned bitext to train, for example, translation models.Another option is to use NLLB-Seed for domain finetuning, such as adapting general-purpose translation models to the Wikipedia domain.

Source sentence selection
Data for NLLB-Seed was sampled from Wikimedia's List of articles every Wikipedia should have,17 a collection of 10,000 Wikidata IDs corresponding to notable topics in different fields of knowledge and human activity.These are split into 11 categories such as People, History, Philosophy and Religion, and Geography.We uniformly sampled a subset of IDs from which we would draw data and mapped these to the corresponding English Wikipedia articles.From each of these articles, we sampled data that would be sent to translators.Instead of extracting individual sentences, which would have left translators with little context to work with, we chose to sample triplets of contiguous sentences, ensuring no more than one triplet per article was used (similar to Flores-

200).
Like Flores-200, NLLB-Seed's source data is English-centric and sampled from English Wikipedia. 18This has an important effect-the content reflects what Wikipedia editors find is relevant for English Wikipedia and likely does not cover a diverse spread of content from different cultures.Furthermore, the target text in NLLB-Seed is ultimately translated by humans and thus potentially contains effects of translationese (often defined as awkward, unnatural, or overly literal translations) [78].

Translation workflow
Script, specification, spelling, and translation approaches were first established against Flores-200.Translators referenced these linguistic alignments while working on seed data translations.The data sets were translated directly from English for 39 languages while two Arabic script languages (Acehnese and Banjar) and Tamasheq in Tifinagh script were transliterated from their respective Latin script data sets (first translated from English). 19Following translation or transliteration was a linguistic quality assessment phase in which the completed data sets were checked against the linguistic alignments from FLORES-200, along with automatic quality control checks.
We note that NLLB-Seed has a key distinction compared to evaluation benchmarks such as Flores-200.Critically, NLLB-Seed is meant to be used for training rather than model evaluation.Due to this difference, NLLB-Seed does not go through the human quality assurance process present in Flores-200.

C Human evaluation details
The final human quality test encompassed a 20% assessment by independent reviewers from a language service provider (LSP).The reviewers assessed translation errors at the sentence level, and the translation quality score per language was determined based on the number of errors identified by the reviewers.The following errors were examined: grammar, punctuation, spelling, capitalization, addition or omission of information, mistranslation, unnatural translation, untranslated text, and register.Each error was also associated with a severity level-minor, major, and critical.The overall score is constructed by tallying these different error types.The acceptable translation quality score was set at 90%.It is also important to note that there was first an initial alignment between the translators and LSP on the approach to take for each language.In cases of large disagreements, translators were also allowed to arbitrate with the reviewers to further align their understanding of translation quality.This was especially helpful for languages with lower levels of standardization.

D Data technical details
To train NLLB-200, we leveraged three different types of bitexts:

Primary bitexts
We use a set of publicly available parallel corpora from a variety of sources, including NLLB-Seed (Appendix I).We added a total of 661 sets of primary bitext data.We chose all English-centric sets when available and also added non-English-centric pairs if they had a low resource language as source, target, or both.Table 5 provides further information on the list of public bitext corpora we used for training.

Mined bitexts
We used bitext corpora retrieved by large-scale bitext mining, as detailed in Section 2.1.2.We added mined data for a total of 784 directions.These included all English-centric directions and a subset of non-English-centric directions.Non-Englishcentric mined data effectively improves the performance of multilingual translation systems [1].However, having 200 languages implies approximately 40,000 non-Englishcentric pairs, and adding all the pairs could be detrimental (as some pairs do not have high-quality mined bitexts).To select based on projected quality, we first picked directions with a xsim error rate of under 5.As a further restriction, we added mining data primarily for pairs containing low-resource languages within a given language family or a geographical region.This is an imperfect approximation to ensure improved transfer learning between similar languages.

Back-translated bitexts
Back-translated data provides a form of weak supervision, which is crucial for improving the translation performance of low-resource languages.Combining back-translation data generated from multiple sources improves the performance of a translation model due to increased back-translation diversity.Following this, we generated backtranslated data from two models: (1) a multilingual neural machine translation model (MmtBT) and (2) a set of bilingual statistical machine translation models (SmtBT).We used monolingual data for a total of 192 languages to generate back-translated bitexts.
We share below the full list of bitexts used for training.

D.1 Effect of using different data sources on performance
We expected to see cumulative benefits by combining different sources of data.We empirically explore this hypothesis in this section.

Experimental Setup
We trained dense 3.3B Transformer encoder-decoder models with model dimension 2048, FFN dimension 8192, 16 attention heads, and 48 layers (24 encoder, 24 decoder) for these data ablation experiments.We trained these models on three sets of data: (1) Primary, (2) Primary+Mined, and (3) Primary+Mined+MmtBT+SmtBT to compare the cumulative improvements coming from adding each source of data.All models were trained for a total of 300k iterations, and we report the results with best chrF ++ score checkpoints.

Results
In Figure 6, we show the impact of adding different data sources over Primary data.We aggregated results over language pair type and resource level.We observe that across all language pairs, performance improves significantly by adding Mined data and further by adding MmtBT+SmtBT back-translated data.Focusing our observation on resource levels, we observe that low-resource languages improve more than We observe significant improvements in adding mined and back-translated data for all types of language pairs and resource levels.
high-resource languages.This is not surprising, as high-resource languages already have significant amounts of Primary bitext data publicly available.

Impact of mining and back-translation on very low-resource languages
Looking deeper at the results, we investigated how mined and back-translated data sources impact very low-resource languages.We define very low-resource as languages with fewer than 100K unique sentence pairs across all language pairings available in public bitext corpora, with 84 total.On aggregate, our proposed techniques of mining and back-translation improved low-resource and very low-resource language directions significantly (see Figure 6).Most prominently, very low-resource into English directions improved by +12.5 chrF ++ with mined data and +6.1 chrF ++ with additional back-translation data, with an overall improvement of +18.6 chrF ++ .Similarly, we observe that out-of-English directions improve by +4.7 chrF ++ when adding mined data and +1.9 chrF ++ when adding back-translated data, with an overall improvement of +6.6 chrF ++ .For non-English-centric pairs, we see an improvement of +7.5 chrF ++ when adding mined data and +1.4 chrF ++ when adding back-translated data, with an overall improvement of +8.9 chrF ++ .These results show that our improvements in bitext mining and back-translation increase the data quantity and quality for low-resource languages often underserved or excluded by existing translation systems.

D.2 The 200 language dataset
Combining multiple sources of data, our final data set covers 200 languages. 20The data set comprises primary bitext for 661 language pairs, mined bitext for 784 language pairs, and 261 directions of back-translated bitext.In total, there are 1220 language pairs or 2440 directions (xx-yy and yy-xx) for training.These 2440 directions result in over 18B total sentence pairs.Figure 7 displays the distribution of samples across the 1220 language pairs-the majority of the pairs have fewer than 1M sentences and are low-resource directions.

E.1 Technical details
Both the encoder and decoder are stacks of Transformer layers.Each Transformer layer takes a sequence of embeddings as input and outputs a sequence of embeddings.In the encoder, Transformer layers are composed of two sub-layers, a self-attention and a feed-forward layer.These are applied sequentially and are both preceded by a LayerNorm [96] and followed by a residual connection [97]: We applied LayerNorm at the beginning of each sub-layer (Pre-LN) instead of applying LayerNorm after the residual connection at the end of each sub-layer (Post-LN).This is because Pre-LN is more stable in practice compared to Post-LN [98].The self-attention layer is an attention layer that updates each element of the sequence by looking at the other elements, while the feed-forward layer (FFN) passes each element of the sequence independently through a 2-layer MLP.In the decoder, there is an additional third sublayer between the self-attention and the feed-forward, which computes attention over the encoder output.We refer the reader to [63] for further details.

Sparsely gated mixture of experts
As illustrated in Figure 8, we replaced the FFN sublayer in dense models with an MoE sublayer once every f MoE layers in both the encoder and decoder.The MoE sublayer consists of E feed-forward networks (FFN), denoted with (FFN 1 , FFN 2 , . . ., FFN E ), each with input and output projections W o .A gating network, consisting of a softmax-normalized linear layer with weights W g , is attached to each MoE sublayer to decide how to route tokens to experts.Given an input token x t the output of the MoE sublayer is evaluated as: with G t ∈ R E the routing vector computed by the gating network, i.e., for each expert, G t,e is the contribution of the e th expert (FFN e ) in the MoE output.We followed the Top-k-Gating algorithm of [18] and dispatched each token to at most k = 2 experts.We always chose the top two scoring experts per token and did not add randomization to the choice of the second expert.
The Transformer encoder-decoder model, supplemented with MoE layers and their respective gating networks, learns to route input tokens to the corresponding toptwo experts by optimizing a linearly weighted combination of label-smoothed cross entropy [39] and an auxiliary load balancing loss [20].This additional loss term (LB) pushes the tokens to be uniformly distributed across experts and is evaluated as: where f e is the fraction of tokens routed to the e th expert, as their first choice, through Top-k-Gating, and p e is the average routing probability to that expert over the T tokens in the mini-batch.We refer the reader to [18] for more on the optimization of MoE models.

E.3 Finetuning NLLB-200
Our goal in the next set of experiments is to examine if we are developing a robust general-purpose MT system capable of translating in various domains.For this purpose, we study if NLLB-200 can effectively transfer to other domains and if it lends itself to the common strategy of single-task finetuning with small quantities of in-domain high quality translations [99][100][101][102].
Experimental Setup.
We experimented with the NLLB-MD dataset (see Appendix J).It provides high-quality translations in four domains-news, scripted formal speech (scripted), unscripted informal speech (chat), and health.Language wise, it includes translations from English to six languages (five of which are low-resource).We held 500 sentences in each language for testing, finetuned on 2000 sentences, and used the remainder for validation.In each translation direction (into and out of English), we finetuned NLLB-200 on that single task for 50 updates (15-20 epochs) with a learning rate of 5e-5 following an inverse square-root schedule after warming up for ten updates.We considered two options for finetuning NLLB-200 for the new task: (1) finetuning with the original training objective (label-smoothed cross-entropy with an additional load balancing regularization term) and (2) finetuning without regularization and, thus, leaving the MoE's load distribution unconstrained.

Results.
Figure 9 shows validation chrF ++ scores in the chat domain tasks of the pre-trained NLLB-200, the similarly finetuned model with load balancing (NLLB-200+FN+LB), and the finetuned model without load balancing (NLLB-200+FN).On average, finetuning (FN+LB) improves the accuracy by +6.1 chrF ++ points.The performance gain is more considerable when translating into high-resource languages (eng and rus), with an average +8.9 chrF ++ points.When translating into the five low-resource languages in NLLB-MD, the gain is 2.0 chrF ++ points.When switching off the load balancing regularization, NLLB-200+FN improves by +7.2 chrF ++ points.Particularly noteworthy is when translating into low-resource languages, which produces an increase of 3.7 points.We next finetuned with our best strategy (NLLB-200+FN) on the other three domains of NLLB-MD and report chrF ++ scores on the test sets in Figure 10.On average, by finetuning NLLB-200, we improved translation accuracy in the new domains by +7.7 in chat, +3.1 in news, +4.1 in health, and +5.8 in scripted (all in terms of chrF ++ ).These results are evidence of NLLB-200's transferability and adaptability to other domains.
The issue of finetuning sparsely activated large models has been raised in prior work [21,103,104].These large models are more prone to overfitting than their dense counterparts and, in some cases, perform poorly when finetuned [103,104].Fedus et al. [104] suggests increasing regularization with expert dropout, effectively applying stronger regularization to the expert parameters, while Zoph et al.
[21] combat overfitting by updating only a subset of model parameters.With MoE Expert Output Masking (EOM), NLLB-200 is heavily regularized and exhibits less overfitting on downstream tasks.We hypothesize that without load balancing, we allow the model to drop experts, practically activating a few that will be finetuned for the downstream task.This is particularly relevant when finetuning on a single task for which NLLB-200 has learned to assign specific experts (see section 8.5 from [34]); adding load balancing loss when the mini-batches are not mixed will considerably shift this learned assignment.We leave the exploration of MoE finetuning strategies with added regularization, selective fine-tuning, and relaxed optimization for future work.

F Human evaluation details
Annotators All evaluators were professional translators.Beyond this qualification, the standard requirements used were: 3+ years' translation experience in a language pair; native speaker fluency in the target language; high level in English (C2-C1).

XSTS
We adapted the recently proposed XSTS methodology from Agirre et al. [49].In short, XSTS is a human evaluation protocol focusing on meaning preservation above fluency.For low-resource languages, translations are usually of poorer quality, and so we focus on usable (i.e., meaning-preserving) translations, even if they are not fully fluent.Compared to Direct Assessment [72] with a 5-point scale (the original direct assessment uses a 100-point scale), work has found that XSTS yields higher inter-annotator agreement [48].
XSTS rates each source sentence and its machine translation on a five-point scale, where one is the lowest and five is the highest.Each point on the scale is as follows: 1.The two sentences are not equivalent, share few details and may be about different topics.If the two sentences are about similar topics, but less than half of the core concepts mentioned are the same, 1 is still the appropriate score.2. The two sentences share some details but are not equivalent.Some important information related to the primary subject/verb/object differs or is missing, which alters the intent or meaning of the sentence.3. The two sentences are mostly equivalent, but some unimportant details can differ.There cannot be any significant conflicts in intent or meaning between the sentences, no matter how long the sentences are.4. The two sentences are paraphrases of each other.Their meanings are nearequivalent, with no major differences or missing information.There can only be minor differences in meaning due to differences in expression (e.g., formality level, style, emphasis, potential implication, idioms, common metaphors). 5.The two sentences are precisely and completely equivalent in meaning and usage expression (e.g., formality level, style, emphasis, potential implication, idioms, common metaphors).Further details on calibration are reported in section 7.2 of [34].

G Limitations
In the previous sections, we documented how several data, modeling, and evaluation challenges were overcome to realize NLLB-200.In this section, we underline some limitations in our effort.

Bitext mining for low-resource languages
For some languages, we could only create a small amount of bitext through data mining.The main limiting factor lies in the paucity of monolingual data.More specifically, many low-resource languages have a limited web presence, and even though the data we curated was processed across many stages (i.e., language identification, aggressive cleaning of monolingual data, etc.), the amount of training data for different languages remained unbalanced.An important final consideration is the web is saturated with machine-translated content.For example, many websites may use translation to localize their content.On the upside, most of the languages we targeted in NLLB-200 are not supported by most existing commercial translation services.However, in the process of mining higher-resource languages, it is likely that our mined data sets contain pre-translated content.
We also want to reflect on the issue of data ownership.In an interview study we conducted with low-resource language speakers, many participants expressed that sharing language access might, in fact, be a necessary trade-off for technological advancement.Blocking such access meant blocking any future benefits that could positively impact low-resource language communities.However, we stress that access and ownership are two disparate concepts.Even though we deploy many low-resource language data sets, ownership ultimately belongs to the speakers of these languages.
Pairing self-supervised learning with machine translation Recent work [75,76,105] demonstrates that denoising and similar self-supervised objectives are very useful for improving model performance when trained concomitantly with machine translation tasks in a multitask setup.In NLLB-200, we tried two self-supervised learning (SSL) objectives and experimented with different combinations of both alongside the MMT task.We observe that only denoising autoencoder (DAE) performs well when trained with MMT.The benefits of the LM task in a multitask setup with MMT are not well-studied, and future work could reveal a deeper understanding of the mechanisms supporting this finding.
Deploying translation models for specific domains or language families Practically deploying machine learning models is technically challenging and remains an active area of research.Our investigation indicates that distillation is a promising avenue for leveraging multilingual models and adapting them to a subset of desired language directions and domains.This has allowed the Wikipedia translation model trained in NLLB Team et al. [34] to perform better than much larger models.In the same paper, we also demonstrated multidialectal translation capabilities by translating from and into different Arabic languoids.We found that while a massive multilingual model achieves the best average score, a smaller specialized model outperforms the former in specific directions.This highlights the importance of more focused research on closely related languages.

Curating benchmark datasets for low-resource languages
Compared to creating FLORES-101, our new translation workflow substantially streamlined the process of realizing FLORES-200.For example, the number of languages requiring re-translation in FLORES-200 was ten, down from 45 in its predecessor.However, despite these improvements, we continued to experience similar difficulties to those of FLORES-101, but at an even greater scale due to the increasingly low-resource nature of the supported languages.Moreover, industry-wide standards for dealing with these lower-resource languages are limited, leading to more logistical barriers for us to navigate [84].This led to longer turnaround times, occasionally forced by the need to find new translators and reviewers.In the cases of Sicilian and Buginese, work on these languages took significantly longer than other languages to complete (287 days).
XSTS for human evaluation XSTS scoring followed by calibration successfully addresses the issues of evaluation consistency across evaluators and language directions in a massively multilingual context.However, as this metric is focused on meaning preservation rather than fluency, it may face difficulties when used to evaluate the quality of translations across coexisting language registers.

Added toxicity detection
Detecting added toxicity remains challenging, especially when detection must be done at scale for 200 languages.Since we evaluated our approach on a translated data set, the quality of translations may be a confounding factor worth exploring.For example, the quality of the toxicity detection can be affected by the amount of resources available per language.Alternatively, the quality and efficiency of our detectors, which locate or filter toxicity, may vary depending on list-building inconsistencies, list length, segmentation accuracy, the degree of complexity in morphological variation, and the amount of non-lexicalized toxicity.The expansion and disambiguation of small toxicity lists are critical areas for future work, which likely require close collaboration with a larger number of native speakers.A first step towards disambiguation can be contextualizing polysemous words by replacing single tokens with n-grams that have a much higher probability of representing true toxic content.Finally, we know that added toxicity can be caused by phenomena that would be considered instances of hallucination.Our visualization examples with ALTI+, which show a low amount of source contribution in toxicity when computed with this method, are a strong indicator of hallucination.Additional work aiming to further quantify and mitigate added toxicity is already in progress [106].• Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment.NLLB-200 is trained on general domain text data and is not intended to be used with domain-specific texts, such as medical or legal domains.The model is not intended to be used for document translation.The model was trained with input lengths not exceeding 512 tokens.Therefore, translating longer sequences might result in quality degradation.NLLB-200 translations can not be used as certified translations.

Metrics
• Model performance measures: NLLB-200 model was evaluated using BLEU, spBLEU, and chrF ++ metrics widely adopted by machine translation community.Additionally, we performed human evaluations with the XSTS protocol and measured the toxicity of the generated translations.

Evaluation Data
• Datasets: Flores-200 dataset is described in section 4 of the paper.
• Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200.
• Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece.The SentencePiece model is released along with NLLB-200.

Training Data
• We used parallel multilingual data from various sources to train the model.We provide a detailed report on the data selection and construction process in section 2 of the paper.We also used monolingual data constructed from Common Crawl.We provide more details in section 5.2 of the paper

Ethical Considerations
• In this work, we took a reflexive approach in technological development to ensure that we prioritize human users and minimize risks that could be transferred to them.While we reflect on our ethical considerations throughout the article, here are some additional points to highlight.For one, many languages chosen for this study are low-resource languages, with a heavy emphasis on African languages.While quality translation could improve education and information access in many of these communities, such access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams.The latter scenarios could arise if bad actors misappropriate our work for nefarious activities, which we conceive as an example of unintended use.Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web.Although we invested heavily in data cleaning, personally identifiable information may not be entirely eliminated.Finally, although we did our best to optimize for translation quality, mistranslations produced by the model could remain.Although the odds are low, this could have an adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).

Caveats and Recommendations
• Our model has been tested on the Wikimedia domain with a limited investigation on other domains supported in NLLB-MD.In addition, the supported languages may have variations that our model is not capturing.Users should make appropriate assessments.

Carbon Footprint Details
• The carbon dioxide (CO 2 e) estimate is reported in section 8.8 of the paper.
a For this card, we used the template from [107].The NLLB-Seed data is a collection of human-translated data sampled from Wikimedia' s List of articles every Wikipedia should have b , a collection of 10,000 Wikidata IDs corresponding to notable topics in different fields of knowledge and human activity.It contains bitext from English to 43 languages in 6193 sentences.The motivation of this data was to provide a starter set of clean data on a variety of topics in those languages.

• How to use the data
You can access links to the data in the README at https://github.com/facebookresearch/fairseq/tree/nllb • Supported Tasks and Leaderboards NLLB model uses this data to boost the performance of low-resource languages.

• Languages
NLLB-Seed contains 43 language pairs with English.

Dataset Creation • Curation Rationale
Script, dialect, spelling, and translation approaches were first established and aligned on from Flores-200.Translators referenced these linguistic alignments while working on NLLB-Seed translations.The data sets were translated directly from English for 39 languages; half the data for Ligurian (3000 sentences) were first translated from English to Italian, then translated from Italian to Ligurian while the other half was translated directly from English, and three Arabic script languages (Acehnese, Banjar, Tamasheq) were transliterated from their respective Latin script datasets that were translated from English.Following the translation or transliteration phase was a linguistic quality assessment phase in which the completed data sets were checked against the linguistic alignments from Flores-200 along with basic quality sanity checks.The data sets were then finalized and completed.

• Source Data
Source Data includes 6193 English sentences sampled from Wikipedia Articles in 11 categories: Anthropology, Arts, Biology, Geography, History, Mathematics, People, Philosophy, Physical, Society, Technology.

• Annotations
There are no extra annotations with the bitext.

Considerations for Using the Data
• Social Impact of Dataset The dataset is specifically built to increase the translation quality and improve language identification of the extremely low-resource languages it contains.This helps improve the quality of different languages in machine translation systems.

• Discussion of Biases
Biases on the dataset have not been studied.

Additional Information
• Dataset Curators All translators who participated in the NLLB-Seed data creation underwent a vetting process by our translation vendor partners.
Translators are required to be native speakers and educated in the target language.They must also have a high level of fluency (C1-C2) in English.For non-English translators, they are required to have a high level of fluency in their source language.Translators were also required to have at least two to three years of translation experience in the relevant language pair if they have an academic degree in translation or linguistics and three to five years of translation experience if they do not have any relevant academic qualification.Translators also undergo a translation test every 18 months to assess the quality of their abilities.

Fig. 6 :
Fig.6: Comparing model performance when trained on data from various sources.We observe significant improvements in adding mined and back-translated data for all types of language pairs and resource levels.

Fig. 7 :
Fig. 7: Distribution of Amount of Training Sentence Pairs across 1220 language pairs in our dataset.We observe that the majority of pairs have fewer than 1M sentences and are low-resource.

Fig. 8 :
Fig. 8: Illustration of a Transformer encoder with MoE layers inserted at a 1:f MoE frequency.Each MoE layer has E experts and a gating network responsible for dispatching tokens.

Fig. 9 :Fig. 10 :
Fig. 9: Comparison of NLLB-200 with and without Finetuning on the 12 English-centric tasks of NLLB-MD.NLLB-200+FN+LB and +FN refer to finetuning with and without load balancing (LB).We report accuracy in terms of chrF ++ on the validation set.
Card -NLLB-200 Model Details a • Person or organization developing model: Developed by Meta AI Research • Model date: June 30th, 2022 • Model version: NLLB-200 • Model type: Transformer Mixture-of-Experts machine translation model.-Information about training algorithms, parameters, fairness constraints or other applied approaches, and features The exact training algorithm, data, and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper.NLLB Team et al., No Language Left Behind: Scaling Human-Centered Machine Translation, arXiv, 2022 -License: CC-BY-NC b -Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issuesIntended Use • Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, especially for low-resource languages.It allows for single-sentence translation among 200 languages.Information on how to use the model can be found in Fairseq code repository, along with the training code and references to evaluation and training data.• Primary intended users: Primary users are researchers and the machine translation research community.

Table 4 :
No Language Left Behind languages: We display the language Code, language name, Script, and language Family.The symbol indicates machine translation support by Google and/or Microsoft (as of July 2022), whereas ✗ indicates support by neither.Res.indicates if we classify the language as high or low-resource.Specification contains, if available, additional information on the language variant collected in Flores-200.The superscript new indicates new languages added to Flores-200 compared to Flores-101.

Table 5 :
Summary of some of the main datasets used in training NLLB-200.Direction counts do not include reverse directions.