Introduction

In the recent literature, large language models (LLMs) have demonstrated great promise in healthcare, for example, closed-source models such as GPT-41 and MedPalm-22 have shown remarkable performance, and successfully passed the United States Medical Licensing Examination (USMLE). Concurrently, open-source models like Llama 2 have also facilitated the development of specialized language models for medicine, such as MEDITRON, PMC-LLaMA, MedAlpaca, and ChatDoctors3,5,6,6, gradually bridging the performance gap with their closed-source peers. Despite these advancements, the primary focus on English-language applications by these sophisticated medical language models has constrained their potential reach, limiting the benefits to a wider, linguistically diverse audience.

In the realm of open-source multilingual Large Language Models (LLMs), exemplified by BLOOM7 and the more recent InternLM 28, a notable challenge persists despite their training on diverse multilingual corpora, that is, they exhibit unsatisfactory performance on medical queries in non-English languages, a discrepancy primarily attributed to the under-representation of medical content in these general datasets. This paper endeavors to bridge this gap by developing an open-source, multilingual language model for healthcare. As shown by Fig. 1, our contribution is threefold: firstly, we gather a multilingual medical corpus designed for auto-regressive training, this aims to lay a robust foundation that accurately reflects the linguistic diversity and complexity of the medical domain; secondly, to monitor the progress, we introduce a new comprehensive multilingual medical question-answering (QA) benchmark, enabling the evaluation on multi-choice QA and rationale ability of different language models under both zero-shot and fine-tuning settings; lastly, we have tested a wide spectrum of existing language models, together with those that have undergone auto-regressive pre-training on our corpus. Through this comprehensive evaluation, we aim to provide valuable insights into the models’ capabilities and fostering a deeper understanding of the intricacies involved in multilingual medical query processing.

Fig. 1: Overview of our contributions.
figure 1

a The figure demonstrates our proposed large-scale multilingual medical corpus (MMedC), containing 25.5B tokens, covering six main languages, collected from four data sources. b The figure shows the composition of our comprehensive multilingual medical benchmark (MMedBench), that is constructed by aggregating medical QA cases in different languages, and prompting GPT-4 to provide rationale sentences. MMedBench enables the evaluation on both multi-choice accuracy and the ability of rationale generation for different LLMs under zero-shot or fine-tuning settings. c The line plot shows the final multi-choice accuracy of various LLMs on our MMedBench are shown, where our final model MMed-Llama 3 demonstrated the best performance among all existing open-source LLMs. d The comparison bar further details the gains in both multi-choice accuracy and ability of rationale generation, when comparing MMedLM 2 to InternLM 2, or comparing MMed-Llama 3 to Llama 3. Considering that the main difference between our models and their base models lies in the auto-regressive training on MMedC, such comparison highlights the importance of our contributed medical-specific multilingual language corpus. Source data are provided as a Source Data file.

For auto-regressive training, we have developed a large-scale Multilingual Medical Corpus (MMedC), amassing over 25.5 billion medical-related tokens across six primary languages: English, Chinese, Japanese, French, Russian, and Spanish. This diverse dataset was compiled from four distinct sources: (i) we devised an automatic pipeline to filter medical-related content from the broad multilingual corpus, ensuring a focused and relevant dataset, (ii) we curated an extensive collection of medical textbooks in various languages, and converted them into texts with carefully designed pre-processing, e.g., Optical Character Recognition (OCR), heuristic data filtering, etc. We will share the name list of the books and the methodologies and tools for curation, (iii) to guarantee a wide-ranging encapsulation of medical knowledge, we incorporated texts from some open-source medical websites, enriching our corpus with authoritative and comprehensive medical information, (iv) we have integrated a number of existing small-scale medical corpus datasets, further enhancing the breadth and depth of ours. To our knowledge, MMedC represents the first endeavor to construct a corpus specifically focused on the multilingual medical domain.

As for benchmark curation, we start by aggregating existing medical multiple-choice QA datasets across the six languages as of MMedC. We further augment them with rationale content by using GPT-4, enriching the datasets with explanations that support the correct answers. Consequently, our enriched dataset encompasses 53,566 QA pairs across six languages, uniquely offering both multi-choice QA and accompanying rationale reasoning. This extensive collection spans 21 medical fields, including but not limited to Internal Medicine, Biochemistry, Pharmacology, and Psychiatry, among others, termed as the Multilingual Medical Benchmark (MMedBench). We divide it into 45,048 training pairs and 8518 testing pairs. The training split enables to finetune LLMs after domain-specific continues training. We utilize the entire test set, comprising 8518 QA pairs, to evaluate the accuracy of multi-choice question answering. To further examine the models’ reasoning ability, we select a subset of 1136 QA pairs, each accompanied by manually verified rationale sentences, serving as a more specialized benchmark for reasoning evaluation.

At the evaluation phase, we conducted comprehensive benchmarking across eleven existing LLMs with multilingual support, including, GPT-3.5, GPT-4, Gemini-1.0 pro, BLOOM, InternLM, InternLM 2, MedAlpaca, ChatDoctor, PMC-LLaMA, Mistral, BioMistral, MEDITRON, Llama 2 and Llama 3, alongside the LLMs further trained with MMedC. These models were evaluated across three different settings: zero-shot, parameter-efficient fine-tuning (PEFT), and full fine-tuning. Given the complexity of evaluating rationale quality, which demands an assessment of long sentence semantic integrity, in addition to leveraging mainstream automated metrics, we also incorporate human rating scores in our analysis. This dual approach not only provides a comprehensive measure on each model’s performance, but also enables us to scrutinize the correlation between automated metrics and human judgment. Through this analysis, we identify the most reliable metric for extended comparisons, thereby enriching the methodology for evaluating reasoning ability in large language models.

In our experiments, models that underwent further auto-regressive training on the MMedC consistently demonstrate enhanced performance, thereby underscoring the value and effectiveness of our compiled multilingual corpus. Our final model MMed-Llama 3 demonstrates the best performance on both multilingual and English-only benchmarks. We will publicly release our dataset (except for the license-restricted books for which we will provide a name list), codebase, and trained models to foster future research. In addition, we recognize the significance of robust evaluation metrics, particularly for the generation of medical texts that often involve complex, long sentences. To this end, we will also release detailed human rating results for individual cases.

Results

Here, we start by presenting the statistics of our constructed datasets. Then, we evaluate various LLMs on MMedBench on multi-choice questions and rationale ability, as well as verifying the effectiveness of MMedC. Lastly, we conduct a series of ablation studies to investigate the impact of each dataset component.

Data statistics

We present the detailed statistics on the two proposed datasets, namely, MMedC, the most extensive Multilingual Medical Corpus to date, and MMedBench, a new multilingual medical benchmark.

We first present the multilingual medical corpus (MMedC), which refers to a multilingual medical corpus containing over 25.5B tokens, acquired mainly from four sources, i.e., filtering medical-related content from the general large-scale multilingual corpus, medical textbooks, medical websites and existing small-scale corpus. The main statistic results are plotted in Fig. 2.

Fig. 2: Statistic results on MMedC.
figure 2

a The Distribution of languages included in MMedC around the world (This map is just for demonstration and has nothing to do with politics.). The map shows our collected corpora can cover most main countries worldwide. b The Token distribution for each language. The bar plot shows the detailed token number for different languages. c The Contributions of four sources to six languages for our MMedC. The Sankey diagram shows how the four considered data sources contribute for different languages, i.e., filtering content, medical textbooks, medical websites and small-scale corpus. Source data are provided as a Source Data file.

In detail, our analysis begins with the composition of the Multilingual Medical Corpus (MMedC), which incorporates six languages that collectively cover a significant portion of the global population. This diversity ensures our model’s broad applicability across various linguistic contexts, as illustrated in Sub-figure (a). Subsequently, Sub-figure (b) presents a detailed breakdown of the token distribution across these languages. Notably, English constitutes the largest segment at 42%, while Russian represents the smallest at just 7%. However, it’s important to highlight that even the smallest share, given the corpus’s overall volume of 25.5 billion tokens, translates into a substantial amount of text-approximately 2 billion tokens. Lastly, Sub-figure (c) delineates the contribution of four distinct sources to our dataset across different languages. Predominantly, medically-related content filtered from broader datasets forms the bulk of contributions for most languages, supplemented by data from medical textbooks, medical websites, and pre-existing small-scale corpora. The variety of sources ensures richness of medical knowledge, ranging from everyday medical information to more specialized knowledge found in textbooks and encyclopedias. This detailed examination of our data sources sheds light on the nuanced composition of MMedC, offering insights into its diverse and comprehensive nature.

Then, to better evaluate the performance of multilingual medical models, we further present a comprehensive multilingual medical Question and Answering Benchmark (MMedBench). We start by delving into its core attributes, including the total number of training and testing cases, the distribution of answer options, and the average length of question-and-answer tokens. Figure 3a illustrates these fundamental characteristics, highlighting that MMedBench often includes questions with multiple correct options, which brings complexity for models to navigate. Additionally, the answers contain rationale sections averaging 200 tokens each. This substantial token count serves two purposes: it helps to train language models by exposing them to extended reasoning passages, but also in evaluating their ability to generate and understand lengthy, complex reasoning statements.

Fig. 3: Statistic results on MMedBench.
figure 3

a The bar plot shows the foundation statistic number on the train and test set of MMedBench. The term “Avg. tokens” represents the mean token length per sample across various compositions in it. “Rationale” denotes the rationale sentences in answer. “Option” denotes the option descriptions in choice list and “question” denotes the question sentences. Then the term “Prop. of multi-option” denotes the proportion of the question with multiple correct options and “Prop. of single-option” denotes the proportion of those with one options in answer. The final term “Number of QA pairs” denotes how many QA pairs are in train or test split. b The statistic histogram shows the topics distribution in the test split of MMedBench, covering a wide range of medical aspects, ranging from general and specialized medicine to basic medical sciences. This allows MedQA to comprehensively measure the performance of medical models. Source data are provided as a Source Data file.

In our detailed exploration of MMedBench, we categorize each question into one of the 21 medical topic categories using GPT-4. These categories include Internal Medicine, Biochemistry, Pharmacology, Psychiatry, Microbiology, Physiology, Pathology, Immunology, Obstetrics and Gynecology, Public Health, Hematology, Surgery, Emergency Medicine, Orthopedics, Neurology, Anatomy, Medical Genetics, Radiology, Dermatology, and Endocrinology. This categorization has been rigorously verified by at least two clinicians, to ensure its comprehensiveness, and covering the breadth of medical disciplines.

Figure 3b showcases the diversity of our multilingual benchmark, spanning a wide array of medical questions from foundational clinical medicine to specialized areas such as pharmacology and public health, with a pronounced emphasis on areas like Internal Medicine and Biochemistry. This underlines the effectiveness of the benchmark, to assess models’ ability on recognizing and processing a broad spectrum of medical inquiries effectively.

Evaluation on MMedBench

In this section, we present a comprehensive benchmark of the foremost LLMs using our MMedBench under zero-shot, PEFT, and full fine-tuning settings. Our evaluation focuses on two aspects of model performance: the accuracy in multiple-choice questions and the models’ ability to generate rationales. The evaluated LLMs can be categorized into four distinct classes, i.e., closed-source LLMs, popular open-source LLMs, medical-specific open-source LLMs, and those further undergone training on our MMedC. The latter three can all be categorized into open-source LLMs.

Initially, our analysis focuses on the state-of-the-art, proprietary closed-source LLMs developed by OpenAI and Google, specifically GPT-3.5, GPT-4, and Gemini-1.0 pro. These models are examined through their publicly available online API solely in a zero-shot setting, as they are not accessible for any further training. However, note that, as the training data for these closed-source models are confidential, it is difficult to judge whether they are really “zero-shot”. Following this, our evaluation encompasses a range of open-source LLMs such as Mistral, InternLM 2 and Llama 3. We observe that the response from these open-source LLMs is relatively poor, making it difficult to draw effective comparisons (check Supplementary Material F for more zero-shot fail cases) in the zero-shot setting. We therefore compare them in fine-tuning settings (PEFT and full fine-tuning). Among these, we make a further distinction between general LLMs and those specifically tailored for the medical domain. Finally, we evaluate models that have undergone further training on our proposed corpus, named MMedLM (based on InternLM), MMedLM 2 (based on InternLM 2) and MMed-Llama 3 (based on Llama 3). These models are uniquely augmented with domain-specific knowledge by auto-regressive training on MMedC.

We first evaluate models on Multilingual Multiple-choice Question & Answering tasks. As illustrated in Table 1, medical-specific Large Language Models (LLMs) generally exhibit high accuracy scores in English, yet their performance significantly declines in languages other than English. Notably, the finetuned PMC-LLaMA achieved an English accuracy score of 47.53, despite outperforming its contemporaneous counterparts, falls behind the GPT models significantly. Later, with the deployment of more advanced foundational models, open-source models started to bridge the gap with the GPT series, for instance, Mistral, InternLM 2, Llama 3, once fine-tuned on the train set of MMedBench, recorded average accuracy scores of 60.73, 58.59 and 62.79, respectively, surpassing all predecessors of comparable scale. Enhanced performance is also observed after additional auto-regressive training on our proprietary MMedC dataset. Specifically, our final model, MMed-Llama 3, demonstrated significant improvements over it’s counterparts without further training on MMedC, for example, 67.75 (MMed-Llama 3) vs. 62.79 (Llama 3) under full fine-tuning evaluation. A similar observation also holds for the PEFT setting, i.e., later LLMs performs better and training on MMedC brings a significant gain. As a result, MMed-Llama 3 refers to the most competitive open-source model with 8B parameters in proximity to GPT-4’s accuracy of 74.27.

Table 1 Mutliple-choice accuracy evaluation on MMedBench

In addition to multiple-choice QA tasks, our study extends to examining the rationale ability of various LLMs. To facilitate this comparison, we employ several automatic metrics, namely BLEU9 and ROUGE10, which assess sentence similarity based on n-grams. Furthermore, we explore the use of BERT-score11, a metric that uses a pre-trained BERT model to extract high-level semantic features and employs cosine similarity for semantic evaluation.

We offer detailed instructions that prompt the model to outline its analytical process for delivering the final answer, enabling a clear assessment of its reasoning capabilities. The performance is then meticulously evaluated using a variety of metrics. Specifically, the ROUGE-1 and BLEU-1 scores are presented in Table 2. Additionally, results for other metrics are detailed in Supplementary Material E providing a comprehensive view of the model’s performance across diverse evaluation frameworks.

Table 2 Rationale evaluation on MMedB with ROUGE-1/ BLEU-1

Given the limitations of automatic metrics in evaluating free-text generation, we further employ relative human ratings to rank performance and identify the most reliable automatic metric for future in-depth evaluations.

Specifically, from the test set of MMedBench, we randomly selected 50 test cases per language, alongside outcomes generated by six notable models: MMed-Llama 3 (ours), Llama 3, InternLM 2, BioMistral, MEDITRON, GPT-3.5. The sequence of samples and corresponding model outputs are randomized to prevent bias. The evaluation panel, comprising five post-graduate students from the medical school of Shanghai Jiao Tong University and Peking Union Medical College, was instructed to rank the outputs based on accuracy, reasoning ability, and internal knowledge. To facilitate accurate assessment, I also provide manually verified references. Rankings were quantitatively assigned, with the highest rank awarded a score of 6 and the lowest a score of 1, thereby quantifying the quality of each model’s output. In parallel, we leveraged GPT-4 as an additional evaluator, assigning it the role of a judge to rank the outputs. Further details on GPT-4’s evaluation method are available in Supplementary Material A.

Figure 4a illustrates the comparative analysis of model performances through relative ratings. Notably, MMed-Llama 3 achieved the highest scores in both human (4.10) and GPT-4 (4.73) evaluations, aligning with its superior performance as indicated by the automatic machine metrics. It is particularly worth highlighting that MMed-Llama 3 can significantly outperform the other models in the GPT-4 rating, surpassing the second-best model InternLM 2 by 0.89 rating scores. Interestingly, GPT-3.5 received a lower human rating of 2.37, suggesting that the evaluators’ preferences might be influenced by the brevity of the responses. Comprehensive rating results for each language and model are detailed in Supplementary Material E.

In addition to comparing different LLMs, our study delves into the correlation between various automatic evaluation metrics and human preferences. This correlation analysis enables us to identify the most effective automatic metric for benchmarking purposes, thereby potentially eliminating the need for resource-intensive human evaluations in future research. We employ the Kendall rank correlation coefficient to measure the agreement between the rankings of each model’s generated rationales by automatic metrics and human evaluations. The findings, illustrated in Figure 4b, indicate that GPT-4’s evaluation results have the highest correlation with human judgments, with a τ value of 0.660. However, it’s important to note that GPT-4’s ratings, while highly correlated, are relative and not easy to scale for evaluating newly introduced models. Among the absolute automatic metrics, BERT Score emerged as the most reliable indicator, demonstrating a τ value of 0.538. Consequently, we advocate for the use of Bert Score as benchmark for assessing the rationale capabilities of newly introduced LLMs on MMedBench in subsequent studies.

Fig. 4: Comparative analysis on model ratings.
figure 4

a Score bars represent ranked scores under different metrics. BLEU score rating denotes the rating score calculated based on ranking by BLEU score. Human rating refers to rankings provided by humans, while GPT-4 rating refers to rankings generated by GPT-4. b The fitted lines present the correlation between human rating results and different automatic metrics. τ is the Kendall rank correlation coefficient while k is the slope of fitted line. Source data are provided as a Source Data file.

Evaluation on Public English Benchmarks

Here, we incorporate additional English instructions (from PMC-LLaMA3) into MMed-Llama 3 finetuning, and present comparisons with other existing LLMs on English-only benchmarks. Specifically, there are four widely used multiple-choice question-answering benchmarks, namely, MedQA, MedMCQA, PubMedQA and MMLU (Massive Multitask Language Understanding)-Medical2,12,14,15. The details on these benchmarks can be found in Section. Roughly speaking, MedQA and MedMCQA are clinical exams, mainly assessing the diagnosis or treatment ability, PubMedQA focusing on biomedical academical question-answering and MMLU-Medical is a medical sub-split of MMLU, targeting at assessing the basic knowledge for different medical concepts.

As shown in Table 3, MMed-Llama 3 demonstrate state-of-the-art performance on English benchmarks, specifically, we obtain 4.5%, 4.3% and 2.2% performance gain, respectively on MedQA, MedMCQA and PubMedQA. Similarly, on MMLU, our model can achieve the best performance on most results among the open-source LLMs, even surpasing the strong GPT-3.5 significantly, e.g., 72.59 vs. 67.69.

Table 3 Multiple-choice accuracy evaluation on various English multiple-choice question-answering benchmarks

Ablation studies on data composition

We present an analysis on the effects of dataset construction process, as depicted in Table 4. Our ablation studies are carried out on MMedLM, MMedLM 2 and MMed-Llama 3 under the full fine-tuning setting, leveraging the InternLM, InternLM 2 and Llama 3 as base models. Overall, the results observed on the three models are mostly consistent, in the following, we will thus focus our discussion on MMed-Llama 3.

Table 4 Ablation study on MMedB

Here, we distinguish HQ-Data (High-Quality Data) and US-Data (Unspecified Source Data). HQ-Data includes content sourced from books and websites, which has undergone thorough human verification, whereas US-Data is derived from filtering medical-related content from a general corpus. The outcomes, detailed in Table 4, reveal that equipping the model with comprehensive rationales results in an average multiple-choice accuracy increase of 4.06 points, elevating from 58.72 to 62.79. However, further auto-regressive training exclusively on the English segment of MMedC does not yield an overall accuracy improvement. We conjecture this is due to overfitting on English, leading to superior performance in English, but inferior results on other languages (check Supplementary Material E for more details). While expanding the auto-regressive training into the entire multilingual medical corpus, the problem can be largely alleviated, significantly improving the final results. This includes not only boosting the choice accuracy to 64.40, but also enhancing reasoning capabilities by 0.48 and 0.54 points respectively on BLEU-1 and ROUGE-1. Moreover, the inclusion of an automatically gathered US-Data facilitates an additional accuracy boost from 64.40 to 67.75, representing a significant increase of 3.35 points. Performance gains can also be observed in rationale ability, i.e., 0.29 in BLEU-1 and 0.16 in ROUGE-1.

Discussion

In this section, we will first highlight the main empirical conclusions from our experimental results, followed by the potential impact of this work, and finally the existing limitations.

Experimental results

From our experimental results, we can draw the following critical conclusions.

First, auto-regressive training on MMedC is effective. As revealed in Table 1, All MMedLM, MMedLM 2 and MMed-Llama 3 demonstrated significant improvement over their original baseline models, namely, InternLM, InternLM 2 and Llama 3, underscoring the effectiveness of MMedC for providing targeted domain-specific knowledge. In addition, the observed performance boosts serve as an indication that the pre-training corpora of existing LLMs exhibit limitations when faced with multilingual medical contexts. Our findings reinforce the necessity of specialized corpora like MMedC to bridge these gaps.

Second, incorporating More Data is Generally Effective. While exploring how varying data sources affect the outcomes of language model performance, our findings, presented in Table 4, reveal that the inclusion of high-quality multilingual data (HQ-Data) can lead to significant performance improvements. Additionally, we observe that incorporating data filtered from general language corpus, despite its relatively lower quality compared to more explicitly medical-related sources, is also effective. This improvement underscores the value of integrating diverse data types within MMedC.

Third, incorporating rationale for fine-tuning is effective. While fine-tuning on MMedBench (trainset), we observed that, the integration of rationale data with multiple-choice prediction, can enhance performance on specific tasks. As shown by Table 4, combining correct answers with their rationales during the supervised fine-tuning phase not only enables LLMs to output rationale sentence, but also results in a noteworthy multiple choice accuracy improvement of 2.33% for InternLM, 2.42% for InternLM 2 and 4.07% for Llama 3 on the MMedBench (testset). This indicates that the two tasks are strongly correlated and reinforces the significance of training on multi-choice prediction and rationale tasks jointly.

Fourth, strong foundational LLMs improve the final results. On MMedBench, we also notice that stronger LLM backbones (commonly released later) generally improve the final results on multilingual medical QA. With the release of more advanced LLMs, their pre-training corpus has been expanded significantly, gradually encompassing more languages. Even though non-English languages constituted a small fraction of the total, the sheer volume of the overall corpus allowed the models to encounter a vast array of multilingual texts during training, significantly enhancing their multilingual capabilities, as seen with the comparison between Llama 2, Mistral and Llama 3, where the later models all performs much better than the former one. Such enhancement in general multilingual language abilities can also improve the performance after adaptation in medical domain (MMedLM vs. MMedLM 2 vs. MMed-Llama 3). This observation shows we should focus more on building up open-source datasets for medicine, that allows future works to better leverage the rapid improvement of general LLMs.

Research impacts

Moreover, by initiating the development of multilingual medical LLMs, our work can promote the following critical research directions:

Promote General Medical Artificial Intelligence (GMAI) development. GMAI16 commits to developing a multimodal AI model that can be directly applied to a wide range of healthcare scenarios, where LLMs are often used as a human-machine interface17,18,19. Replacing the English-centric LLM with a multilingual one enables to make good use of worldwide data source, thus expanding the available multimodal training data, and improving the representation quality for other modalities.

Improve retrieval augmented generation. Hallucination is considered as a major problem with existing LLMs, especially in medical domain. One potential solution is to develop retrieval-augmented architectures20,21,22. The key motivation is that by retrieving facts from extra knowledge base, the generated outputs from LLMs can avoid most fatal fact error. However, until now, most efforts have been made in English, greatly limiting retrieval-augmented methods to leverage medical knowledge in other languages. Developing multilingual LLMs can benefit the retrieval process, greatly enriching the potential available knowledge base.

Clinical impacts

Beyond research impacts, on clinical practice, open-source multilingual medical LLMs can also meet the following demands.

Ease the language barrier. In many healthcare systems, language barriers between patients and healthcare providers can hinder effective communication, leading to misunderstandings, misdiagnoses, and inadequate care, causing high-quality medical resources inaccessible for most people. Multilingual medical LLMs can facilitate real-time translation and interpretation, ensuring that patients can effectively communicate their symptoms and understand their diagnoses and treatment options.

Reduce the cultural and law sensitivity. Multilingual medical LLMs can also be trained to recognize and address cultural or law nuances and sensitivities for different countries in healthcare interactions. Understanding cultural backgrounds and law differences can significantly enhance the trust to medical LLMs, leading to better health outcomes.

Help medical education. These models can also be customized for education, especially in regions where there is a shortage of medical educators or resources. By providing educational materials and simulations in multiple languages, medical multilingual LLMs can help standardize medical training and ensure consistent quality of care worldwide.

Potential limitations

While our work primarily focuses on constructing a multilingual medical corpus and enhancing the capabilities of LLMs for medicine across various languages, we encountered certain limitations.

First, given that a significant portion of our data is acquired through web crawling, it is inevitable that the corpus may contain inherent biases against certain underprivileged populations. This is a critical challenge in the development of medical Language Models (LLMs) as highlighted in previous research23. In the future, we will explore more stringent and comprehensive safety controls on potential bias.

Second, on explainability, although we strive to enhance the model with extra rationale capabilities, to help users to understand the final decisions. It remains under-explored on developing explainability for LLM architectures, like those utilized for convolutional blocks or MLPs24.

Third, the languages in this dataset do not cover all the world population. In the future, we anticipate expanding to include more languages, for example, German and Arabic. Specifically, the common crawl datasets25 comprise over 167 languages, with our filtering pipeline, we can efficiently extract medical-related terms by defining specific filtering seed words. In addition, in numerous languages, medical literature is available to support local medical education, and integrating these resources into our approach can further enrich the training corpus. Moreover, as general LLMs become increasingly robust, although they may not accurately answer medical questions across various languages, they can effectively rewrite reference sentences into alternative formats or translate them into other languages, tasks that are comparatively simpler. This capability can serve as an augmentation strategy to enhance data for extremely low-resource languages.

Lastly, considering the computational costs, our final model is in 8B scale, in the future, we will switch the training progress to a larger architecture with retrieval augmentation, which can potentially achieve better results, while alleviating hallucination issues.

Methods

In this part, we provide details on our methodology. Specifically, in section, we introduce the construction pipeline for MMedC. In section, we describe the auto-regressive training procedure. In section, we discuss the new multilingual medical benchmark, MMedBench, including its curation procedure, evaluation settings and metrics.

Large-scale multilingual medical corpus

We herein develop a new large-scale multilingual medical corpus, MMedC, to help enrich LLMs with domain-specific medical knowledge across different languages. In detail, we explore four primary sources, e.g., filtering medical-related content from general language corpus, medical textbooks, open-source medical websites and existing small-scale multilingual medical corpus. As a result, MMedC contains over 25B tokens, covering 6 main languages, e.g., English, Chinese, Japanese, French, Russian, and Spanish. Next, we will introduce the data collection process from the four sources respectively.

Filtering medical-related content

The first way to obtain medical-related content is using heuristic algorithms for filtering. In the broader landscape of natural language processing, the general NLP community has amassed an extensive array of corpora, such as CommonCrawl, which captures billions of web pages monthly and has been operational for years. Despite that medical-related content only constitutes a small fraction of this colossal dataset, its sheer volume presents a valuable opportunity for creating a large-scale, medical-specific corpus with the application of sophisticated auto-filtering techniques.

Our methodology begins with the CulturaX dataset26, a meticulously curated multilingual version of CommonCrawl, with 6.3 trillion tokens. We first introduce a rule-based filtering pipeline to sift through this dataset for medical content. This process involves the careful selection of 200 medically relevant terms per language, encompassing fields such as medicine, pharmacy, and medical biology. Given the space limitation in the paper, we will list all 1200 terms in our GitHub repositories. For sentences that utilize spaces for word separation, our approach includes word segmentation followed by keyword matching. Conversely, for sentences without clear word demarcation, we employ direct keyword matching. Utilizing the matching results, we establish two principal metrics:

Algorithm 1

Determining Medical-Related Text Samples

Input: Text T, Set of keywords K, Language type Lang

Output: True or False

Define TC as threshold for MKC and TD for DENS

if Lang = “Space Delimited" then  Split text into words first for space delimited languages

 Segment T into words based on spaces

end If

Initialize \({K}_{U}={{\emptyset}}\)

Initialize total keyword length L ← 0

for each word t in T do

if tK then

  Increment L by len(t)

  iftKU then

   Add t to KU

  end If

end If

end for

Calculate MKC and DENS

if MKC > TC and DENS > TD then

return True  Text is considered medical-related

else

return False  Text is not considered medical-related

end If

Medical keyword count quantifies the number of unique medical keywords in the texts. Let K be the set of a priori keywords representing medical terms of interest, and let T denote the entire text corpus under analysis. The set of unique keywords appearing in the text can be formulated as KU = {kk T k K}. The Medical Keyword Count (MKC) is then defined as MKC = KU, where denotes the cardinality of the set.

Keyword density measures the proportion of text occupied by medical keywords relative to the total text length. This metric is instrumental in identifying texts that, despite their length, only incidentally include medical terms. Let len(T) denote the total number of characters in the text T, and occ(tT) denote the occurrence times of word t in T. The keyword density, denoted as D, can be formulated as:

$$D=\frac{{\sum }_{k\in K}len(k)\cdot occ(k,T)}{len(T)}$$
(1)

With the two metrics, we simply set a threshold bar to filter each sentence. To control the filtering quality, we randomly sampled 100 sentences per language, and on average, 98 sentences are manually checked as medical-related. The final threshold and filtering ratios are detailed in Supplementary Material C.

Medical textbooks

In addition to filtering the general language corpus, we also collect dozens of medical textbooks, that represent a rich repository of extensive medical knowledge, underscored by a rigorous publication process that ensures content quality. We have curated a collection exceeding 20,000 books, in line with the methodology outlined in PMC-LLaMA3. To extract texts from the books, we adopted Optical Character Recognition (OCR) models, specifically, we used the PaddleOCR tool for its proficiency in handling multiple languages. The OCR process generates a list detailing the coordinates and content of each text box, which is then organized in a left-to-right and top-to-bottom order. Furthermore, to ensure a focus on medical content, we excluded non-essential pages, such as covers, tables of contents, and epilogues, identifying them by their page numbers for removal. Quantitatively, we finally collect 4B tokens for English, 1.1B tokens for Chinese, 0.4B tokens for Russian and 0.3B tokens for French.

Medical websites

Considering that filtering-based data are based on CommonCrawl, which is randomly scratched and untraceable, to avoid missing some important medical knowledge websites, we further crawled a number medical-related websites as compensatory. We focus on three types of websites. Firstly, we target medical encyclopedias, which offer detailed information on diseases and drugs. While this data is of exceptional quality, it is often limited in quantity and subject to stringent access controls. Secondly, we source content from medical consultation platforms and popular science articles about medicine. These sources, though less technical, provide a wealth of knowledge on medical common sense. Lastly, we expand our data collection to include medical news websites, which allow us to gather a larger volume of unrestricted data and incorporate timely information into our model. This strategy enhanced the model’s ability to understand and respond to current medical events and trends. Collecting data from these varied websites, we compile a comprehensive and diverse medical corpus, encompassing in-depth professional medical knowledge as well as a broad spectrum of general medical information and up-to-date industry insights. As a result, we get 0.1B tokens for Japanese, 0.05B tokens for Spanish, and 0.1M tokens for French.

Existing small-scale multilingual medical corpus

Apart from the above newly collected data, we also leverage many existing open-source corpus. Specifically, we utilized the following three datasets: Wikipedia27, Baidu Baike28, and UFAL Medical Corpus29. For Wikipedia and Baidu Baike, we employ the same filtering methodology mentioned before to extract the medical domain corpus, while for UFAL, a medical corpus designed for translation tasks, we use it directly.

Auto-regressive training on MMedC

Once constructed MMedC, we further pre-train the existing LLMs on it in an auto-regressive manner. We adopt loss for the next token prediction as used in GPT1. Specifically, we treat medical text as a sequence of tokens, denoted as X = {x1x2, …, xN}, where each xi is a text token and N represents the total length of the sequence. For a token xi to be predicted, the optimization objective is:

$$L(\phi )=-\sum \log (\Phi ({x}_{i}| {x}_{ < i}))$$
(2)

Comprehensive multilingual medical benchmark

In addition to the multilingual datasets for training, we also collect a comprehensive Multilingual Medical Benchmark, spanning 6 principal languages, namely MMedBench, to conduct a thorough evaluation of a model’s performance in the medical domain across diverse languages. Specifically, we started by collecting existing medical question-answering (QA) benchmarks for each language, and expand these multi-choice QA with corresponding explanations using GPT-4, followed by strict human verification to ensure the correctness of contents.

Multilingual medical QA dataset

Evaluating the performance of Large Language Models (LLMs) has conventionally relied on the utilization of multiple-choice questions. This evaluation framework presents the question and its corresponding options to the model, which is then expected to identify the correct answer’s index. Accuracy serves as the primary quantitative metric in this method, providing a direct and objective measure of the performance. Despite its efficacy, the prevalent medical multi-choice QA benchmarks are exclusively monolingual, thus falling short of adequately assessing LLMs’ capabilities across diverse languages. To address this deficiency and foster a more inclusive evaluation landscape, our approach involves the aggregation of various medical multi-choice QA datasets from multiple languages. This initiative aims to compile a comprehensive benchmark that reflects the multilingual realities of the medical field. The following benchmarks are considered:

  • MedQA30 is a collection of medical multiple-choice questions (each with four answer options) based on the USMLE examination. It encompasses data in three languages: English, Simplified Chinese, and Traditional Chinese. For our evaluation, we exclusively utilize the English and Simplified Chinese sections. The data is partitioned by official guidelines.

  • IgakuQA31 is a Japanese medical multi-choice question dataset, which comes from Japanese medical licensing examinations from the past five years (2018-2022). Since there is no official data division, we randomly divide the data and get 1,590 training samples, 199 validation samples, and 199 test samples.

  • FrenchMedMCQA32 is a French medical multi-choice question dataset, which comes from real exams of the French medical specialization diploma in pharmacy. The data is divided according to the official release.

  • RuMedDaNet13 is a Russian medical judgment question dataset, which we process into a binary-choice question format. The data is divided according to the official way.

  • Head-QA33 is a Spanish multiple-choice question dataset. Questions come from exams to access a specialized position in the Spanish healthcare system. The data is divided according to the official way.

As a result, we collected 53566 QA pairs in total, for those without an official definition of train-test sets, we split them in 8:1:1 for training, validating, and testing respectively, resulting in 45048 pairs for training and 8518 for testing.

Rationale generation

While the accuracy of multi-choice question answering is a straightforward and accurate metric, it fails to evaluate the reasoning and long sentence generation abilities of LLMs, which is critical for clinical usage. Therefore, we further augment each question with a justification for selecting the correct option. At evaluation phase, we prompt the model to articulate the rationale behind its choice, thereby offering insights into the model’s reasoning capabilities, as shown in Fig. 5.

Fig. 5: The pipeline of MMedBench construction.
figure 5

Firstly multi-choice QA pairs from various languages are collected from 5 QA datasets. Then corresponding rationale is generated with the help of GPT4. The rationale of testset is further checked by humans to ensure its quality.

In detail, given GPT-4’s demonstrated capability to outperform human experts by providing detailed explanations in Chain of Thought (CoT) experiments34, we utilize GPT-4 to generate rationales for our dataset. To guarantee the quality of these explanations, we subsequently perform human verification. Specifically, we input the question, the options, and the correct choice into GPT-4, instructing it to generate a detailed rationale for selecting a particular option. The instructions are as follows, where “{language}” will be replaced by a certain language name, like Chinese or French, in a certain case:

You’re a {language} doctor. Analyze the reasons for choosing this particular option in 100 words for the following question in {language}.

Following the rationale generation by GPT-4, we conducted manual review to assess their quality. Our evaluation criteria are two-fold: first, the explanation provided by GPT-4 had to be consistent with the established correct answer for the question; second, it’s required to articulate the logic underpinning the answer, rather than merely replicating it. Note that, considering the cost of human checks, this will only be performed on part of our test set. Specifically, we randomly selected 200 samples from the former test split for each language to form a new rationale split. Then we shuffled and distributed among three annotators for manual verification. Annotators were tasked with categorizing each rationale as either qualified or unqualified based on the aforementioned criteria. Remarkably, we observed that 94.7% of the rationales generated by GPT-4 adhered to our standards, underscoring the high quality of the explanations. During the final evaluation phase, the calculation of rationale similarity was exclusively applied to these human-verified passed samples. Finally, we get 1136 human-checked samples for rationale evaluation and compensate the former split 45048 training QA pairs with auto-generated rationale sentences. Given a language model, it can use our training set to further fine-tune and then be evaluated or directly evaluated on the rationale and choice testing sets.

Topic classification

Subsequently, we explore the thematic distribution of the samples. For this purpose, we employ GPT-4 to categorize the topics within the test set. The instructions provided to GPT-4 for topic classification are outlined as follows, where similarly, “{language}” will be replaced by a certain language name.:

You’re a {language} doctor, choose one subject out of {medical_subjects_string}, which is most relevant to the following question.

At times, GPT-4 may yield ambiguous classification outcomes, such as those not aligning with predefined medical subjects. In these instances, we prompt GPT-4 to try the classification up to four times. Should it fail to produce a categorization that adheres to our criteria, we assign the sample’s category as ‘None’. Given the rarity of such occurrences, their impact on the overall statistics is minimal.

Evaluation settings

To comprehensively assess the model’s performance, we tested it in three different evaluation settings: zero-shot, parameter efficient fine-tuning (PEFT), and full fine-tuning. For the zero-shot setting, we directly test off-the-shelf LLMs with proper instruction, without any further exposure to the training part of MMedBench. In addition to zero-shot, to better evaluate the performance differences between models, we also try to fine-tune and then test the open-source models. There are two mainly used fine-tuning approaches: parameter efficient fine-tuning (PEFT), and full fine-tuning. In the former case, only a small subset of the model’s parameters are trainable, representing the performance in a low computing resource available scenario. We adopt the most representative PEFT method LoRA35 here. In the latter case, all parameters will be fine-tuned which is a more conventional practice.

Next, we will introduce the baseline LLMs considered in our work for comparison:

  • GPT-41, a groundbreaking multilingual large language model developed by OpenAI, stands as one of the most sophisticated LLMs to date. Due to the confidentiality of data and model details, it is uncertain for its detailed model scale. Though GPT-4 does not emphasize it is a multilingual LLM, its multilingual ability is still superior. Given that it is only accessible through API, we evaluate it in a zero-shot manner with proper instruction (Check Supplementary Material A for more details).

  • GPT-4 (5-shot, CoT) uses in context learning36 and chain-of-thought37 to further improve the performance of GPT-4, which represents the highest performance currently achievable. For implementation, we follow prompts used in MedPrompt14. Notice that, despite this approach can enhance the performance of different LLMs, it will take up more tokens and result in additional costs.

  • GPT-3.538 is also developed by OpenAI. Similarly to GPT-4, it is unknown for detailed model sizes or training data composition and never claims whether it is a multilingual or monolingual LLM but it also performs well for multilingual input. As the predecessor to GPT-4, it continues to exhibit robust performance for everyday applications and remains extensively utilized. We assess GPT-3.5 using the same API-based methodology as applied to GPT-4.

  • Flan-PaLM15 and MedPaLM 22, is two close-source multilingual biomedical LLMs developed by Google. They demonstrate strong performance in medical English multiple-choice question answering. However, since it provides neither model weights nor access API function, we can only compare with them on the widely-used English benchmarks. A famous variant of Flan-PaLM is MedPaLM15, while, in the original paper, the multiple-choice question answering accuracy of MedPaLM is not reported. Thus, here we can only compare with Flan-PaLM.

  • Gemini-1.0 pro39 is the latest general multimodal foundation model developed by Google. Though it is targeted at multimodal scenarios, as reported in the original paper, its language ability even surpasses Google’s former LLM, PaLM 212. Similar to GPT series, its detailed scale and whether it is specifically targeted to multilingual or monolingual scenarios are not released. However, in our testing, it responds well for multilingual input.

  • BLOOM7, an early open-source, multilingual LLM family, undergoes pre-training with a diverse range of language corpora. We select the 7B parameter variant for our studies, employing a fine-tuning evaluation approach.

  • MedAlpaca4 is a specialized open-source monolingual medical LLM, further fine-tuned on a LLaMA using a dataset of over 160,000 English medical entries.

  • ChatDoctor5 is a monolingual medical LLM based on LLaMA and further fine-tuned, leverages 100,000 real-world patient-doctor dialogs in English, marking it as a distinct medical LLM. We employ the 7B parameter model, applying a fine-tuning evaluation framework.

  • PMC-LLaMA3 presents another open-source monolingual medical-specific LLM, pre-trained exclusively on English medical literature, including papers and books. We utilize the 7B parameter version for evaluation.

  • Llama 2 and Llama 340, Llama series is a series of open-source LLMs developed by Meta. Llama 2 is the previous generation LLM in the series and Llama 3 is the latest one. Llama models are acknowledged as among the most powerful open-source monolingual LLMs for English within the same timeframe. While primarily trained for English, these models’ vocabulary encompasses tokens for other languages as well. Given their substantial pre-training data, which may include samples from other languages, Llama models can also exhibit promising performance in multilingual scenarios. We engage the 7B parameter model for both Llama 2 and Llama 3 in our evaluation process.

  • Mistral 7B41, released in October 2023, is an innovative open-source monolingual LLM that claims superiority over Llama 2 13B across all assessed benchmarks. We adopt a fine-tuning evaluation methodology for this model.

  • InternLM and InternLM 28, developed by Shanghai AI Lab, are among the leading open-source multilingual LLMs. InternLM was released in July 2023 and InternLM 2 was released in February 2024. For both models, we select the 7B parameter variant and implement a fine-tuning evaluation strategy.

  • MEDITRON6, released in November 2023 is an open-source monolingual biomedical LLM leveraging extra 45B English tokens for further pre-training the general LLM Llama 2. It has two scaling versions, i.e., 7B and 70B and for fair comparison with others, we mainly adopt the 7B version.

  • BioMistral42, released in February 2024, is an open-source multilingual biomedical LLM based on Mistral. It is concurrent with ours and is also targeted at the multilingual biomedical domain. We compare with it as a strong baseline.

  • Gemma43, released in March 2024, is an open-source monolingual LLM developed by Google DeepMind, targeting in English. It demonstrates strong performance across academic benchmarks for language understanding, reasoning, and safety. It has two versions, i.e., 2B and 7B scales. Similarly, for a fair comparison, we adopt the 7B one herein.

We have concluded more detailed information for each model in Supplementary Material B.

Metrics and human rating

In this part, we will introduce the evaluation metrics and human rating criteria used in our work. To evaluate the performance of LLMs, we employ two metrics: Accuracy and Rationale Similarity. Measuring Accuracy is straightforward, as the LLM can generate outputs following a specific template. However, assessing Rationale Similarity presents a more complex challenge, typical within the NLP domain. Initially, we applied three classical text similarity methods, BLEU9 and ROUGE10, and Bert-score11.

BLEU

quantifies the match between a model’s output and that of a reference, focusing on the precision of n-grams. The BLEU is calculated as follows:

$$\,{{\mbox{BLEU}}}={{\mbox{BP}}}\,\cdot \exp \left({\sum}_{n=1}^{N}{w}_{n}\log {P}_{n}\right)$$
(3)

where Pn is the precision of n-grams, wn is the weight for each n-gram size, BP is the brevity penalty. N typically equals 4 in most applications. To BLEU − n, the evaluation focuses only on n-grams of that specific length, by setting wn = 1 for a particular n and setting all other weights to 0. In the standard BLEU calculation, a weighted average of BLEU-1 through BLEU-4 scores is used, with each component typically having equal weight(w1 = w2 = w3 = w4 = 0.25).

ROUGE

is a metric that also focuses on n-grams but uniquely incorporates both Recall and Precision in its calculation. ROUGE is computed as follows:

$$\,{{\mbox{ROUGE}}}\,=\frac{2\times {P}_{n}\times {R}_{n}}{{P}_{n}+{R}_{n}}$$
(4)

where P and R represent Precision and Recall, respectively. Note that ROUGE-N emphasizes the Precision and Recall of n-grams, whereas ROUGE-L calculates the Precision and Recall based on the longest common subsequence (LCS).

BERT-score

utilizes the contextual embedding from a pre-trained BERT to capture high-level semantic features, calculating the similarity between reference and candidate texts through cosine similarity. The Recall is calculated as follows:

$$R=\frac{{\sum }_{{x}_{i}\in x}{{{\mathrm{idf}}}}\,\left({x}_{i}\right){\max }_{{\hat{x}}_{j} \in \hat{x}}{{{{\bf{x}}}}}_{i}^{\top }{\hat{{{{\bf{x}}}}}}_{j}}{{\sum }_{{x}_{i}\in x}{{{\mathrm{idf}}}}\,\left({x}_{{{\rm{i}}}}\right)}$$
(5)

where idf denotes inverse document frequency, enhancing the metric’s sensitivity to rare but significant words. Here, xi and \({\hat{{{{\bf{x}}}}}}_{j}\) represent the embeddings of the ith token in the candidate text and the jth token in the reference text, respectively. Precision is computed similarly, and the F1 score is subsequently derived. In this paper, we employ a pre-trained multilingual BERT model to extract features without conducting baseline rescaling.

Relative rating scores

aim to rank the output based on relative comparison. In detail, we select 6 representative models and sample 50 cases for each language. In human rating, for each case, question, options, right answer and rationale generated by each model are present to annotators, along with the reference rationale. The annotators are asked to rank the model-generated rationales based on the following three evaluation criteria:

  • Accuracy. The model’s ability to correctly select the answer.

  • Reasoning Ability. The model’s capacity to demonstrate logical reasoning in making its selection. The model should go beyond merely repeating the question or options, supporting its choice with reasonable reasoning.

  • Integration of Internal Knowledge. The model needs to effectively blend and utilize its internal knowledge base, providing relevant and persuasive factual evidence to support its answer.

Considering that GPT-4 has achieved near-human performance in many aspects, we use GPT-4 to rank the models in the same way with carefully set instructions following44. Similarly, for BLEU scores, we can also rank model by comparing the absolute metrics.

Then, for all ranking results, i.e., human rating, BLEU score rating and GPT-4 rating, the scores are quantitatively assigned with the ranking level reversely, for example, the top rank reviving a score of 6, and the bottom rank a score of 1, thereby quantifying each model’s output quality relatively.

English benchmark evaluation

Here, we describe how we compare the performance of our model on English with other existing models.

In assessing the capabilities of large language models in medical field, we utilize 4 widely recognized multiple-choice question-answering benchmarks, as follows :

  • MedQA30 is the same as we introduced in MMedBench. It is a widely used and highly credible benchmark for assessing the medical ability of models. Thus we re-use it in English-wise evaluation.

  • PubMedQA45 is an English question-answering medical dataset based on PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe, which can also be viewed as a close-domain multiple-choice question-answering problem. Officially, it is split into three subsets: 1K manually labeled pairs (PQA-L), 61.2K unlabeled pairs (PQA-U), and 211.3K artificially generated pairs (PQA-A). Following former existing works46, we also adopt PQA-L as the test set so that our results can be compared with others directly.

  • MedMCQA47 is a large-scale English multiple-choice question-answering samples. MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam multiple-choice questions covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. The official train split contains 182,822 questions, and the test split contains 4183 questions. Each question has 4 choices. We adopt the official test split to evaluate our model.

  • MMLU-Medicine48 is an English comprehensive large-scale exam questions spanning 57 subjects, aiming to assess the ability of language models across different domains. Following MedPALM 22, we adopt the 6 subjects related to medicine, i.e., anatomy (An), clinical knowledge (CK), college biology (CB), college medicine (CM), professional medicine (PM) and medical genetics (MG), featuring 1,089 questions. We adopt the official split of MMLU for testing.

In English, LLMs may use the mixture of supervised data to further align the model with human semantic instructions after pre-training, commonly referred to as instruction tuning49,50,51. This setup is similar to fine-tuning, but the difference is that instruction tuning often involves designing semantic instructions to aggregate a large number of tasks, rather than just considering a few downstream tasks to be tested. Different LLMs may use different dataset collections for instruction tuning. Thus in English benchmarks it is hard to control the data, like that we perform in the fine-tuning setting on MMedBench, instead, models are compared directly on the unexposed testing set regardless of what tuning data they used. In our case, to enable a fair comparison with existing models, we incorporate an off-the-shelf English instruction fine-tuning dataset (from PMC-LLaMA3) into MMed-Llama 3 finetuning.

Implementation details

In this section, we delve into the specifics of auto-regressive training and fine-tuning. We conduct all our experiments using PyTorch framework and Transformers python package.

Auto-regressive training

During further auto-regressive training on MMedC, our optimization objective aligns with that of the auto-regressive generation task. For data processing, we segment the text into chunks, each comprising 2048 tokens, with an overlapping margin of 512 tokens. Throughout the training, we maintain a maximum context length of 2048 tokens. Owing to the model’s extensive parameter count, which precludes fitting on a single GPU, we employ the Fully Sharded Data Parallel (FSDP) strategy to distribute the model across multiple GPUs. Additionally, we utilize the BF16 data type and gradient checkpointing techniques to optimize memory usage. For InternLM, we establish a global batch size of 512 and a learning rate of 2e-5. In the case of BLOOM, we set the global batch size to 512 and a learning rate of 8e-6. We pre-train both models on eight A100 GPUs, adapting gradient accumulation steps to sustain such a large global batch size. We train the 7B model for 20k iterations, which takes about 20 Days.

Fine-tuning

During the fine-tuning process, our optimization objective remains consistent with the auto-regressive training phase. We set the maximum sequence length to 2048, padding each batch to match the longest sequence in that batch. For full-model fine-tuning, we utilized Fully Sharded Data Parallel (FSDP), BF16 data type, and gradient checkpointing technology. We set the global batch size to 128 and the learning rate to 1e-6. For LoRA, we use the default recommended rank 16 with the similar training setting as full fine-tuning.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.