Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A generalist vision–language foundation model for diverse biomedical tasks

Abstract

Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize holistic information. Generalist AI holds the potential to address these limitations due to its versatility in interpreting different data types and generating tailored outputs for diverse needs. However, existing biomedical generalist AI solutions are typically heavyweight and closed source to researchers, practitioners and patients. Here, we describe BiomedGPT, the first open-source and lightweight vision–language foundation model, designed as a generalist capable of performing various biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25 experiments while maintaining a computing-friendly model scale. We also conducted human evaluations to assess the capabilities of BiomedGPT in radiology visual question answering, report generation and summarization. BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in question answering, satisfactory performance with an error rate of 8.3% in writing complex radiology reports, and competitive summarization ability with a nearly equivalent preference score to human experts. Our method demonstrates that effective training with diverse data can lead to more practical biomedical AI for improving diagnosis and workflow efficiency.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: BiomedGPT can process diverse modalities and perform versatile tasks.
Fig. 2: An overview of BiomedGPT: workflow, performance and pretraining datasets.
Fig. 3: BiomedGPT performs fine-tuning for vision–language and medical-image-classification downstream tasks.
Fig. 4: BiomedGPT performs few-epoch transfer learning for clinical-text understanding and summarization and generates a response through zero-shot transfer learning.
Fig. 5: Human evaluation of the VQA, text-summarization and captioning tasks.
Fig. 6: Results of the ablation study on the impact of diversity of pretraining datasets and tasks and a graphical demonstration of BiomedGPT’s design.

Similar content being viewed by others

Data availability

All data in this study are publicly available and can be accessed from: IU X-ray and Peir Gross (https://github.com/nlpaueb/bioCaption), MedICat (https://github.com/allenai/medicat), PathVQA (https://huggingface.co/datasets/flaviagiammarino/path-vqa), SLAKE 1.0 (https://www.med-vqa.com/slake/), DeepLesion (https://nihcc.app.box.com/v/DeepLesion), OIA-DDR (https://github.com/nkicsl/OIA), CheXpert- v1.0-small (https://www.kaggle.com/datasets/willarevalo/chexpert-v10-small), CytoImageNet (https://www.kaggle.com/datasets/stanleyhua/cytoimagenet), ISIC 2020 (https://challenge2020.isic-archive.com), Retinal Fundus (https://www.kaggle.com/c/diabetic-retinopathy-detection), MIMIC-III Clinic Notes (https://paperswithcode.com/dataset/hospital-admission-notes-from-mimic-iii), NCBI BioNLP (https://www.ncbi.nlm.nih.gov/research/bionlp/Data/), PubMed abstracts derived from the BLUE benchmark (https://github.com/ncbi-nlp/BLUE_Benchmark), VQA-RAD (https://osf.io/89kps/), CBIS-DDSM (https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset), SZ-CXR and MC-CXR (access can be requested via the contact at http://archive.nlm.nih.gov/repos/chestImages.php), MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.1.0/), MedNLI (https://physionet.org/content/mednli/1.0.0/), TREC 2022 (https://www.trec-cds.org/2022.html), SEER (https://seer.cancer.gov), MIMIC-III (https://physionet.org/content/mimiciii/1.4/), HealthcareMagic (https://huggingface.co/datasets/UCSD26/medical_dialog), MeQSum (https://huggingface.co/datasets/sumedh/MeQSum), MedMNIST v2 (https://medmnist.com) and ROCO (https://github.com/razorx89/roco-dataset). A randomly sampled subset of RSNA Pneumonia Detection Challenge (2018) was used for zero-shot prediction (https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018). The MedMNIST-Raw is curated using multiple sources, including NCT-CRC-HE-100K (colon pathology) (https://zenodo.org/records/1214456), HAM10000 (dermoscopy) (https://github.com/ptschandl/HAM10000_dataset), OCT and Chest X-ray (https://data.mendeley.com/datasets/rscbjbr9sj/3), breast ultrasound (https://scholar.cu.edu.eg/Dataset_BUSI.zip), blood cell microscopy (https://data.mendeley.com/datasets/snkd93bnjr/1) and Liver Tumor Segmentation Benchmark (LiTS) (https://competitions.codalab.org/competitions/17094). The VQA data for human evaluation are derived from Medical-Diff-VQA (https://physionet.org/content/medical-diff-vqa/1.0.0/), with the exclusion of questions related to differences, as these require a two-image input. Report generation and summarization samples for human evaluations are extracted from MIMIC-CXR. The instruction-following data used in this article are derived from Pubmed (https://pubmed.ncbi.nlm.nih.gov) following the LLaVA-Med approach (https://github.com/microsoft/LLaVA-Med/blob/main/download_data.sh) and are combined with training sets from PathVQA and SLAKE. We also provided the table with more details of the major datasets in Extended Data Table 2.

Code availability

The pretrained and fine-tuned models, as well as source code for training, inference and data preprocessing, can be accessed at https://github.com/taokz/BiomedGPT.

References

  1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article  CAS  PubMed  Google Scholar 

  2. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article  CAS  PubMed  Google Scholar 

  3. Moody, L. et al. The person-centred care guideline: from principle to practice. J. Patient Exp. 5, 282–288 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Langberg, E. M., Dyhr, L. & Davidsen, A. S. Development of the concept of patient-centredness–a systematic review. Patient Educ. Couns. 102, 1228–1236 (2019).

    Article  PubMed  Google Scholar 

  5. Bates, D. W. et al. Reducing the frequency of errors in medicine using information technology. J. Am. Med. Inform. Assoc. 8, 299–308 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Tu, T. et al. Towards generalist biomedical AI. NEJM AI https://doi.org/10.1056/AIoa2300138 (2024).

  7. Reed, S. et al. A generalist agent. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=1ikK0kHjvj (2022).

  8. Driess, D. et al. Palm-e: an embodied multimodal language model. In Proc. 40th International Conference on Machine Learning 8469–8488 (JMLR.org, 2023).

  9. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (Neural Information Processing Systems Foundation, 2017).

  10. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

    Google Scholar 

  11. Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  12. Li, C. et al. Llava-med: training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems 36 (Neural Information Processing Systems Foundation, 2024).

  13. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. Towards generalist foundation model for radiology. Preprint at https://arxiv.org/abs/2308.02463 (2023).

  14. Luo, R. et al. BioGPT: generative pretrained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).

    Article  PubMed  Google Scholar 

  15. Zhang, S. et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. Preprint at https://arxiv.org/abs/2303.00915 (2023).

  16. Phan, L. N. et al. Scifive: a text-to-text transformer model for biomedical literature. Preprint at https://arxiv.org/abs/2106.03598 (2021).

  17. Lau, J. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 180251 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Liu, B. et al. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proc. IEEE International Symposium on Biomedical Imaging (ISBI) 1650–1654 (Institute of Electrical and Electronics Engineers, 2021).

  19. He, X. et al. Towards visual question answering on pathology images. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) 708–718 (Association for Computational Linguistics. 2021).

  20. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23, 304–310 (2016).

    Article  PubMed  Google Scholar 

  21. Johnson, A. E. et al. MIMIC-CXR-JPG — chest radiographs with structured labels. PhysioNet 101, 215–220 (2019).

    Google Scholar 

  22. Pavlopoulos, J., Kougia, V., & Androutsopoulos, I. A survey on biomedical image captioning. In Proc. Second Workshop on Shortcomings in Vision and Language 26–36 (Association for Computational Linguistics, 2019).

  23. Li, P. et al. Self-supervised vision-language pretraining for medial visual question answering. In Proc. IEEE 20th International Symposium on Biomedical Imaging (ISBI) 1–5 (Institute of Electrical and Electronics Engineers, 2023).

  24. Zhang, X. et al. Pmc-vqa: visual instruction tuning for medical visual question answering. Preprint at https://arxiv.org/abs/2305.10415 (2023).

  25. Van Sonsbeek, T. et al. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention 726–736 (MICCAI, 2023).

  26. Lin, C. Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).

  27. Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (eds. Goldstein, J., Lavie, A., Lin, C.-Y. & Voss, C.) 65–72 (Association for Computational Linguistics, 2005).

  28. Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based image description evaluation. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) 4566–4575 (Institute of Electrical and Electronics Engineers, 2015).

  29. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (Association for Computational Linguistics, 2017).

  30. Chen, Z. et al. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 1439–1449 (Association for Computational Linguistics, 2020).

  31. Liu, F. et al. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13753–13762 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).

  32. Yuan, H. et al. Biobart: pretraining and evaluation of a biomedical generative language model. In Proc. 21st Workshop on Biomedical Language Processing (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 97–109 (Association for Computational Linguistics, 2022).

  33. Van Veen, D. et al. Radadapt: radiology report summarization via lightweight domain adaptation of large language models. In 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (eds. Demner-fushman, D., Ananiadou, S. & Cohen, K.) 449–460 (Association for Computational Linguistics, 2023).

  34. Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 9 (2023).

    Article  Google Scholar 

  35. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

  36. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. Proc. 56th Annual Meeting of the Association for Computational Linguistics 1 (eds. Gurevych, I. & Miyao, Y.) 2577–2586 (2018).

  37. Yang, J. et al. MedMNIST v2 - a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023).

  38. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4, 475–477 (2014).

  39. Capellán-Martín, D. et al. A lightweight, rapid and efficient deep convolutional network for chest x-ray tuberculosis detection. In Proc. 2023 IEEE 20th Int. Symp. Biomed. Imaging (ISBI) 1–5 (IEEE, 2023).

  40. Manzari, O. N. et al. Medvit: a robust vision transformer for generalized medical image classification. Comput. Biol. Med. 157, 106791 (2023).

    Article  PubMed  Google Scholar 

  41. Lee, R. S. et al. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 4, 1–9 (2017).

    Article  Google Scholar 

  42. Romanov, A. & Shivade, C. Lessons from natural language inference in the clinical domain. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 1586–1596 (Association for Computational Linguistics, 2018).

  43. Gloeckler Ries, L. A. et al. Cancer survival and incidence from the surveillance, epidemiology, and end results (SEER) program. Oncologist 8, 541–552 (2003).

    Article  PubMed  Google Scholar 

  44. Abacha, A. B., & Demner-Fushman, D. On the summarization of consumer health questions. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).

  45. Zeng, G. et al. Meddialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 9241–9250 (Association for Computational Linguistics, 2020).

  46. Johnson, A. E. et al. MIMIC-III a freely accessible critical care database. Sci. Data 3, 1–9 (2019).

    Google Scholar 

  47. Dubey, S. et al. Using machine learning for healthcare treatment planning. Front. Artif. Intell. 6, 1124182 (2023).

  48. Roberts, K. et al. Overview of the TREC 2021 clinical trials track. In Proc. Thirtieth Text Retrieval Conference (TREC, 2021).

  49. Van Aken, B. et al. Clinical outcome prediction from admission notes using self-supervised knowledge integration. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 881–893 (Association for Computational Linguistics, 2021).

  50. OpenAI. GPT-4V(ision) system card. OpenAI https://openai.com/research/gpt-4v-system-card (2023).

  51. Wang, P. et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. Proc. Int. Conf. Mach. Learn. PMLR 162, 23318–23340 (2022).

    Google Scholar 

  52. Hu, X. et al. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proc. 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4156–4165 (Association for Computing Machinery, 2023).

  53. Jeong, J. et al. Multimodal image-text matching improves retrieval-based chest x-ray report generation. In Proc. Medical Imaging with Deep Learning 227 978–990 (Proceedings of Machine Learning Research, 2024).

  54. Fu, S. et al. Assessment of data quality variability across two EHR systems through a case study of post-surgical complications. In Proc. AMIA Joint Summits on Translational Science 196–205 (American Medical Informatics Association, 2022).

  55. Delbrouck, J. B. et al. Improving the factual correctness of radiology report generation with semantic rewards. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 4348–4360 (Association for Computational Linguistics, 2022).

  56. Yang, H., Lin, J., Yang, A., Wang, P. & Zhou, C. Prompt tuning for unified multimodal pretrained models. In Findings of the Association for Computational Linguistics: ACL 2023 (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 402–416 (Association for Computational Linguistics, 2023).

  57. Chen, Z. et al. Towards understanding the mixture-of-experts layer in deep learning. Adv. Neural Inf. Process. Syst. 35, 23049–23062 (2022).

    Google Scholar 

  58. Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In International Conference on Learning Representations. (2021).

  59. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (Association for Computational Linguistics, 2019).

  60. Ke, G. He, D. & Liu, T. Y. Rethinking positional encoding in language pretraining. In International Conference on Learning Representations (ICLR, 2019).

  61. Ba, J. L., Kiros, J. R. & Hinton, G.E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016)

  62. Shleifer, S., Weston, J. & Ott, M., Normformer: Improved transformer pretraining with extra normalization. Preprint at https://arxiv.org/abs/2110.09456 (2021).

  63. Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: marrying convolution and attention for all data sizes. In Proc. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) 3965–3977 (Neural Information Processing Systems, 2021).

  64. Wang, Z. et al. SimVLM: simple visual language model pretraining with weak supervision. In International Conference on Learning Representations. (International Conference on Learning Representations, 2022).

  65. Esser, P., Rombach, R. & Ommer, B. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12873–12883 (Institute of Electrical and Electronics Engineers/Computer Vision Foundation, 2021).

  66. Chen, T. et al. Pix2seq: a language modeling framework for object detection. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  67. Gage, P. A new algorithm for data compression. C. Users J. 12, 23–38 (1994).

    Google Scholar 

  68. He, K. et al. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (Institute of Electrical and Electronics Engineers, 2016).

  69. Wei, J. et al. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  70. Schick, T. & Schütze, H. It’s not just size that matters: small language models are also few-shot learners. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds. Toutanova, K. et al.) 2339-2352 (Association for Computational Linguistics, 2021).

  71. Bao, H. et al. BEiT: BERT pretraining of image transformers. In International Conference on Learning Representations (International Conference on Learning Representations, 2022).

  72. Xu, H. et al. E2E-VLP: end-to-end vision-language pretraining enhanced by visual learning. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (eds. Zong, C. et al.) 503–513 (2021).

  73. Sutskever, I., Vinyals, O. & Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (Conference on Neural Information Processing Systems, 2014).

  74. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (International Conference on Learning Representations, 2019).

  75. Micikevicius, P. et al. Mixed precision training. In International Conference on Learning Representations (International Conference on Learning Representations, 2018).

  76. Raghu, M. et al. Transfusion: understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems 32 (Conference on Neural Information Processing Systems, 2019).

  77. Zhou, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Preprint at https://arxiv.org/abs/2302.09419 (2023).

Download references

Acknowledgements

NSF grant CRII-2246067, NSF POSE: Phase II-2346158 and Lehigh Grant FRGS00011497 supported L.S., K.Z., Z.Y. and Y.L. NIH grant R21EY034179, NSF grants NCS-2319451, MRI-2215789 and IIS-1909879, as well as Lehigh’s Accelerator and CORE grants S00010293 and S001250, supported L.H. and R.Z. NIH grants R01HL159183 and RF1AG057892 supported Q.L. NIH grant R03AG078625 supported X.L. NIH grants R01EB19403 and R01LM11934, supported S.F. and H.L. Icons used in Fig. 2 were made by Freepike, surang, Smartline and Blackonion02 at www.flaticon.com.

Author information

Authors and Affiliations

Authors

Contributions

K.Z. and L.S. designed the study. K.Z., R.Z. and E.A. carried out data collection, data preprocessing, model construction and model validation. J.Y., Z.Y., Y.L. and Z.L. carried out the data analysis benchmarking results. X.C., B.D.D., J.H., C.C., Y.Z., S.F., W.L., T.L., X.L., Y.C., L.H., J.Z., Q.L. and H.L. provided knowledge support and interpreted the findings. H.R. carried out the human evaluation for the generated text from BiomedGPT as well as GPT-4V. L.S. provided knowledge support, interpreted the findings and supervised the study. All authors contributed to manuscript writing and reviewed and approved the final version. L.H., X.L. and L.S. co-supervised the study.

Corresponding authors

Correspondence to Xiang Li, Lifang He or Lichao Sun.

Ethics declarations

Competing interests

The research was conducted independently of any commercial or financial relationships that could be construed as a potential conflict of interest. Although X.C. is employed by Samsung, the company was not involved in any aspect of this research. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Statistics of pretraining and fine-tuning datasets.

(a) Modality distribution of pretraining data used in BiomedGPT. (b) For the training and testing splits of datasets used in downstream fine-tuning, we typically follow the format of number of training samples/number of validation samples/number of test samples to detail each dataset. More details of the data split are described in Supplementary Table 7.

Extended Data Fig. 2 Overview of BiomedGPT’s model configuration and architecture.

(a) Detailed model configuration of BiomedGPT. Here, ‘#’ indicates number of. ‘Att.’, ‘Enc.’ and ‘Dec.’ indicate Attention, Encoder and Decoder, respectively. The hidden size is the size of the embeddings and the size of the output of each self-attention and feed-forward layer. The first layer of FFN expands the hidden size to the intermediate size, and the second layer contracts it back to the hidden size. This expansion and contraction allow the network to create more complex representations. During the pretraining phase, image processing involves resizing and cropping the images to varying resolutions, corresponding to the input sizes listed in the table. It should be noted that during fine-tuning and inference stages, the input resolution of BiomedGPT can be flexibly adjusted according to the specific requirements of the task. (b) The neural network architecture of BiomedGPT, which includes bidirectional encoder blocks and autoregressive decoder blocks. The number of blocks varies for different model scales.

Extended Data Fig. 3 The graphical illustrations of the key components in BiomedGPT.

(a) Head-scale multi-head attention module in BiomedGPT. The trainable parameters γh is applied prior to the output projection for each head. (b) Instead of adding the absolute positional embedding Pi to the input embedding Ii (left), we compute the positional correlation and input correlation separately with different projection matrices and add them together in the self-attention module (right). (c) Graphical illustration of relative position bias. Such an inductive bias Bj-i is learnable parameter and can be viewed as the embedding of the relative position ji, which is injected into the Query-Key product: \(\frac{1}{\sqrt{d}}({I}_{i}{W}^{\,Q})({P}_{i}{W}^{\,K})+{B}_{j-i}\), and shared in all layers. (d) An example of trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’, BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞ when computing log-probabilities for the target token ‘in’. It is worth noting that trie-based search is also applied during the validation phase of the fine-tuning stage for acceleration (approximately 16× increase in speed in our experiments).

Extended Data Fig. 4 Comparative Performance of BiomedGPT and Med-PaLM M and the prompt tuning results in Image classification.

(a) Comparison between BiomedGPT-B and Med-PaLM M on CBIS-DDSM dataset. (b) The experimental results of prompt tuning BiomedGPT-B on three image classification datasets. Prompt tuning learns ‘soft prompts’ or extra model parameters for each task instead of making a task-specific copy of the entire pretrained model for each downstream task and inference must be performed in separate batches. We must mention that the addition of soft prompts is contrary to the design principle of the generalist model. We injected two prompt layers into the encoder and decoder, and varied the prompt length {20, 40, 60, 80, 100, 120} to investigate the performance comparison against full-model fine-tuning. The preliminary results of ‘Colon pathology’, ‘Blood cell microscope’, and ‘Chest X-ray’ were obtained after 100, 512, and 55 training epochs respectively, all with a consistent batch size of 512. We observed that as the prompt length increases, the model performance tends to improve. However, despite an increased number of tuning epochs compared with fine-tuning on the original BiomedGPT (Fig. 3c), the performance after prompt tuning notably lags behind that of model fine-tuning. Specifically, considering only the best results in prompt tuning, there are substantial accuracy reductions of 32.3%, 54.6%, and 32.6% on these three datasets, respectively.

Extended Data Fig. 5 Additional zero-shot results of BiomedGPT.

(a) Graphical illustration of zero-shot classification using CLIP-style models, linear probing transfer learning using VIT or BERT-style models, and zero-shot generation of BiomedGPT. Notably, our model can generate the response without providing additional components such as the label candidates for CLIP or linear classifier requiring training for ViT. (b) Zero-shot performance on five disease diagnosis tasks. (c) BiomedGPT shows competitive zero-shot performance compared with Med-PaLM M with a much smaller model scale. The SOTA fine-tuned model for TB detection is TBLightNet. Note that no single model consistently outperforms the others across all four metrics used in report generation. Here, SOTAs represent the best performance achieved in each specific metric. We fine-tuned our pretrained BiomedGPT-B on MultiMedBench, which Med-PaLM M proposed and used for fine-tuning based on the pretrained PaLM-E. We also attempted to fine-tune LLaVA-Med; however, the time and computational costs were prohibitive due to the large scale of the model and data. Therefore, we reported the results using the pretrained checkpoint of LLaVA-Med.

Extended Data Table 1 Fine-tuned experimental results of BiomedGPT on 25 diverse experiments
Extended Data Table 2 Datasets used in BiomedGPT for pretraining, fine-tuning, evaluation with details
Extended Data Table 3 Instructions for pretraining tasks along with the corresponding format of the output
Extended Data Table 4 Description of the question types for human evaluation
Extended Data Table 5 3D medical image classification performance

Supplementary information

Supplementary Information

Supplementary Figs. 1–9 and Supplementary Tables 1–7.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, K., Zhou, R., Adhikarla, E. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat Med (2024). https://doi.org/10.1038/s41591-024-03185-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41591-024-03185-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing