Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Transparent medical image AI via an image–text foundation model grounded in medical literature

Abstract

Building trustworthy and transparent image-based medical artificial intelligence (AI) systems requires the ability to interrogate data and models at all stages of the development pipeline, from training models to post-deployment monitoring. Ideally, the data and associated AI systems could be described using terms already familiar to physicians, but this requires medical datasets densely annotated with semantically meaningful concepts. In the present study, we present a foundation model approach, named MONET (medical concept retriever), which learns how to connect medical images with text and densely scores images on concept presence to enable important tasks in medical AI development and deployment such as data auditing, model auditing and model interpretation. Dermatology provides a demanding use case for the versatility of MONET, due to the heterogeneity in diseases, skin tones and imaging modalities. We trained MONET based on 105,550 dermatological images paired with natural language descriptions from a large collection of medical literature. MONET can accurately annotate concepts across dermatology images as verified by board-certified dermatologists, competitively with supervised models built on previously concept-annotated dermatology datasets of clinical images. We demonstrate how MONET enables AI transparency across the entire AI system development pipeline, from building inherently interpretable models to dataset and model auditing, including a case study dissecting the results of an AI clinical trial.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of MONET framework and its usage examples.
Fig. 2: Images with high concept presence scores calculated using MONET.
Fig. 3: Concept-level data auditing.
Fig. 4: Concept-level model auditing.
Fig. 5: Concept bottleneck model.

Similar content being viewed by others

Data availability

The PMC Open Access Subset is publicly available from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist. Evaluation datasets are all publicly available and can be accessed from: ISIC (https://challenge.isic-archive.com/data), Derm7pt (https://derm.cs.sfu.ca), Fitzpatrick 17k (https://github.com/mattgroh/fitzpatrick17k) and DDI (https://stanfordaimi.azurewebsites.net/datasets/35866158-8196-48d8-87bf-50dca81df965).

Code availability

The code used in our analysis is available at https://github.com/suinleelab/MONET (ref. 84). It includes various scripts for data collection and preprocessing, training the MONET model and conducting benchmark studies. Also, it provides the MONET model weights. The ADAE algorithm can be publicly accessed from https://github.com/ISIC-Research/ADAE.

References

  1. Daneshjou, R., Yuksekgonul, M., Cai, Z. R., Novoa, R. & Zou, J. Y. SkinCon: a skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 18157–18167 (Curran Associates, Inc., 2022).

  2. Mendonça, T., Ferreira, P. M., Marques, J. S., Marcal, A. R. & Rozeira, J. PH 2-A dermoscopic image database for research and benchmarking. In 35th Annual International Conference of the IEEE 5437–5440 (Engineering in Medicine and Biology Society, 2013).

  3. Kawahara, J., Daneshvar, S., Argenziano, G. & Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE J. Biomed. Health Inform. 23, 538–546 (2019).

    Article  Google Scholar 

  4. Nevitt, M., Felson, D. & Lester, G. The Osteoarthritis Initiative. Protocol for the cohort study V 1.1 6.21.06 (accessed 1 Nov 2023); https://nda.nih.gov/static/docs/StudyDesignProtocolAndAppendices.pdf

  5. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

  6. Groh, M. et al. Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 1820–1828 (IEEE, 2021).

  7. Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Gutman, D. et al. Skin lesion analysis toward melanoma detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). Preprint at https://arxiv.org/abs/1605.01397 (2016).

  9. Codella, N. C. F. et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) 168–172 (ISBI, 2018).

  10. Codella, N. et al. Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the International Skin Imaging Collaboration (ISIC). Preprint at https://arxiv.org/abs/1902.03368 (2019).

  11. Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data https://doi.org/10.1038/sdata.2018.161 (2018).

  12. Combalia, M. et al. BCN20000: dermoscopic lesions in the wild. Preprint at https://arxiv.org/abs/1908.02288 (2019).

  13. Rotemberg, V. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data https://doi.org/10.1038/s41597-021-00815-z (2021).

  14. Memorial Sloan Kettering Cancer Center. Consecutive biopsies for melanoma across year 2020. ISIC Archive https://doi.org/10.34970/151324 (2022).

    Article  Google Scholar 

  15. Marchetti, M. A. et al. Prospective validation of dermoscopy-based open-source artificial intelligence for melanoma diagnosis (PROVE-AI study). npj Digit. Med. 6, 127 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Ricci Lara, M. A. et al. A dataset of skin lesion images collected in Argentina for the evaluation of AI tools in this population. Sci. Data 10, 712 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  17. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  18. Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-022-00936-9 (2022).

  19. Agbai, O. N. et al. Skin cancer and photoprotection in people of color: a review and recommendations for physicians and the public. J. Am. Acad. Dermatol. 70, 748–762 (2014).

    Article  PubMed  Google Scholar 

  20. Sierro, T. J. et al. Differences in health care resource utilization and costs for keratinocyte carcinoma among racioethnic groups: a population-based study. J. Am. Acad. Dermatol. 86, 373–378 (2022).

    Article  PubMed  Google Scholar 

  21. DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).

    Article  Google Scholar 

  22. Janizek, J. D., Erion, G., DeGrave, A. J. & Lee, S.-I. An adversarial approach for the robust classification of pneumonia from chest radiographs. In Proc. ACM Conference on Health, Inference, and Learning (ed. Ghassemi, M.) 69–79 (Association for Computing Machinery, 2020).

  23. Bissoto, A., Fornaciali, M., Valle, E. & Avila, S. (De) Constructing bias on skin lesion datasets. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2766–2774 (IEEE, 2019).

  24. Cassidy, B., Kendrick, C., Brodzicki, A., Jaworek-Korjakowska, J. & Yap, M. H. Analysis of the ISIC image datasets: usage, benchmarks and recommendations. Med. Image Anal. 75, 102305 (2022).

    Article  PubMed  Google Scholar 

  25. Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Navarrete-Dechent, C., Liopyris, K. & Marchetti, M. A. Multiclass artificial intelligence in dermatology: progress but still room for improvement. J. Invest. Dermatol. 141, 1325–1328 (2021).

    Article  CAS  PubMed  Google Scholar 

  27. Daneshjou, R., Smith, M. P., Sun, M. D., Rotemberg, V. & Zou, J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol. 157, 1362–1369 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Massie, J. P. et al. Patient representation in medical literature: are we appropriately depicting diversity? Plast. Reconstr. Surg. Glob. Open 7, e2563 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Massie, J. P. et al. A picture of modern medicine: race and visual representation in medical literature. J. Natl Med. Assoc. 113, 88–94 (2021).

    PubMed  Google Scholar 

  30. Lester, J., Jia, J., Zhang, L., Okoye, G. & Linos, E. Absence of images of skin of colour in publications of COVID19 skin manifestations. Br. J. Dermatol. 183, 593–595 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Louie, P. & Wilkes, R. Representations of race and skin tone in medical textbook imagery. Soc. Sci. Med. 202, 38–42 (2018).

    Article  PubMed  Google Scholar 

  32. Groh, M., Harris, C., Daneshjou, R., Badri, O. & Koochek, A. Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm. In Proc. ACM Human Computer Interactions 1–26 (Association for Computing Machinery, 2022).

  33. Rajpurkar, P. et al. MURA: large dataset for abnormality detection in musculoskeletal radiographs. In 1st Conference on Medical Imaging with Deep Learning (MIDL, 2018).

  34. Singh, C., Balakrishnan, G. & Perona, P. Matched sample selection with GANs for mitigating attribute confounding. Preprint at https://arxiv.org/abs/2103.13455 (2021).

  35. Leming, M., Das, S. & Im, H. Construction of a confounder-free clinical MRI dataset in the Mass General Brigham system for classification of Alzheimer’s disease. Artif. Intell. Med. 129, 102309 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Zhao, Q., Adeli, E. & Pohl, K. M. Training confounder-free deep learning models for medical applications. Nat. Commun. 11, 6010 (2020).

  37. Goel, K., Gu, A., Li, Y. & Ré, C. Model patching: closing the subgroup performance gap with data augmentation. In 9th International Conference on Learning Representations (ICLR, 2021); https://openreview.net/forum?id=9YlaeLfuhJF

  38. Sagawa, S., Koh, P. W., Hashimoto, T. B. & Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations (ICLR, 2020); https://dblp.org/rec/conf/iclr/SagawaKHL20.html?view=bibtex

  39. Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Re, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proc. ACM Conference on Health, Inference, and Learning ACM CHIL ’20 (ed. Ghassemi, M.) 151–159 (ACM, 2020).

  40. Zhu, J., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017 2242–2251 (IEEE Computer Society, 2017).

  41. Jones, O. T. et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit. Health 4, e466–e476 (2022).

    Article  CAS  PubMed  Google Scholar 

  42. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4768–4777 (Curran Associates Inc., 2017).

  43. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).

  44. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).

  45. Kim, B. et al. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In Proc. 35th International Conference on Machine Learning (eds Dy, J. G. & Krause, A.) 2668–2677 (PMLR, 2018).

  46. Crabbé, J. & van der Schaar, M. Concept activation regions: a generalized framework for concept-based explanations. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 2590–2607 (Curran Associates Inc., 2022).

  47. Abid, A., Yuksekgonul, M. & Zou, J. Meaningfully debugging model mistakes using conceptual counterfactual explanations. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 66–68 (PMLR, 2022).

  48. Eyuboglu, S, et al. Domino: discovering systematic errors with cross-modal embeddings. In 10th International Conference on Learning Representations, ICLR 2022 (ICLR, 2022); https://openreview.net/forum?id=FPCMqjI0jXN

  49. Chung, Y., Kraska, T., Polyzotis, N., Tae, K. & Whang, S. Automated data slicing for model validation: a big data–AI integration approach. In IEEE Transactions on Knowledge and Data Engineering 2284–2296 (IEEE, 2020).

  50. DeGrave, A. J., Cai, Z. R., Janizek, J. D., Daneshjou, R. & Lee, S.-I. Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-023-01160-9 (2023).

  51. Reyes, M. et al. On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol. Artific. Intell. 2, e190043 (2020).

    Article  Google Scholar 

  52. Arun, N. et al. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiol. Artific. Intell. 3, e200267 (2021).

    Article  Google Scholar 

  53. Han, S. S. et al. The degradation of performance of a state-of-the-art skin image classifier when applied to patient-driven internet search. Sci. Rep. 12, 16260 (2022).

  54. Navarrete-Dechent, C. et al. Automated dermatological diagnosis: hype or reality? J. Invest. Dermatol. 138, 2277–2279 (2018).

  55. Koh, P. W. et al. Concept bottleneck models. In Proc. 37th International Conference on Machine Learning International Conference on Machine Learning 5338–5348 (PMLR, 2020).

  56. Yuksekgonul, M., Wang, M. & Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations (ICLR, 2023); https://dblp.org/rec/conf/iclr/YuksekgonulW023.html?view=bibtex

  57. Rigel, D. S., Friedman, R. J., Kopf, A. W. & Polsky, D. ABCDE—an evolving concept in the early detection of melanoma. Arch. Dermatol. 141, 1032–1034 (2005).

    Article  PubMed  Google Scholar 

  58. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).

  59. Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3876–3887 (Association for Computational Linguistics, 2022).

  60. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

  61. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

  62. Combalia, M. et al. Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge. Lancet Digit. Health 4, e330–e339 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Corbin, C. K. et al. DEPLOYR: a technical framework for deploying custom real-time machine learning models into the electronic medical record. J. Am. Med. Inform. Assoc. 30, 1532–1542 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  64. Coalition for Health AI. Blueprint for Tustworthy AI Implementation Guidance and Assurance for Healthcare (MITRE Corporation, 2023); https://tinyurl.com/CHAI-paper

  65. Bedoya, A. D. et al. A framework for the oversight and local deployment of safe and high-quality prediction models. J. Am. Med. Inform. Assoc. 29, 1631–1636 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Pianykh, O. S. et al. Continuous learning AI in radiology: implementation principles and early applications. Radiology 297, 6–14 (2020).

    Article  PubMed  Google Scholar 

  67. Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 66 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).

    Article  CAS  PubMed  Google Scholar 

  69. PMC Open Access Subset. National Library of Medicine www.ncbi.nlm.nih.gov/pmc/tools/openftlist (2022).

  70. Gamper, J. & Rajpoot, N. M. Multiple instance captioning: learning representations from histopathology textbooks and articles. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021 16549–16559 (Computer Vision Foundation/IEEE, 2021).

  71. Huang, G., Liu, Z., Maaten, L. V. D. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE Computer Society, 2017).

  72. Tan, M. & Le, Q. V. EfficientNetV2: smaller models and faster training. In Proc. 38th International Conference on Machine Learning, ICML 2021 (eds Meila, M. & Zhang, T.) 10096–10106 (PMLR, 2021).

  73. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (ICLR, 2021); https://openreview.net/forum?id=YicbFdNTTy

  74. Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009) 248–255 (IEEE Computer Society, 2009).

  75. Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Proc. 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers 1715–1725 (Association for Computational Linguistics, 2016).

  76. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).

  77. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Machine Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  78. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Informa. Theory 28, 129–137 (1982).

    Article  Google Scholar 

  79. Lanchantin, J., Wang, T., Ordonez, V. & Qi, Y. General multi-label image classification with transformers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16478–16488 (IEEE, 2021).

  80. Jeyakumar, J. V. et al. Automatic concept extraction for concept bottleneck-based video classification. Preprint at https://arxiv.org/abs/2206.10129 (2022).

  81. Sun, X. et al. Interpreting deep learning models in natural language processing: a review. Preprint at https://arxiv.org/abs/2110.10470 (2021).

  82. Klimiene, U. et al. Multiview concept bottleneck models applied to diagnosing pediatric appendicitis. In 2nd Workshop on Interpretable Machine Learning in Healthcare (IMLH, 2022).

  83. Wu, C., Parbhoo, S., Havasi, M. & Doshi-Velez, F. Learning optimal summaries of clinical time-series with concept bottleneck models. In Proc. 7th Machine Learning for Healthcare Conference (eds Lipton, Z. C. et al.) 648–672 (PMLR, 2022).

  84. suinleelab/MONET. Github https://github.com/suinleelab/MONET (2024).

Download references

Acknowledgements

We thank C. Lin and other people in S.-I.L.’s lab for helpful discussions. The members of S.-I.L.’s lab, including C.K., S.U.G., A.J.D. and S.-I.L., received support from the National Science Foundation (grant nos. CAREER DBI-1552309 and DBI-1759487) and the National Institutes of Health (NIH, grant nos. R35 GM 128638 and R01 AG061132). R.D. was supported by the NIH (5T32 AR007422-38) and the Stanford Catalyst Program.

Author information

Authors and Affiliations

Authors

Contributions

C.K., R.D. and S.-I.L. conceived the initial study. C.K., S.U.G., A.J.D. and J.A.O. performed the experiments. Z.R.C. and R.D. evaluated the training data, provided dermatological insights and clinical context in all steps of the analyses, and analyzed images from concept retrieval experiments. C.K., S.U.G., A.J.D., J.A.O., Z.R.C., R.D. and S.-I.L. wrote the paper. S.-I.L. secured funding. R.D. and S.-I.L. co-supervised the study.

Corresponding authors

Correspondence to Roxana Daneshjou or Su-In Lee.

Ethics declarations

Competing interests

R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA and VisualDx for consulting, stock options from MDAcne and Revea for advisory board and research funding from UCB. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Titus Brinker and Ben Glocker for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Dermoscopic images with artifacts from the ISIC dataset as determined by high concept presence scores calculated using MONET.

We show the top 30 images for each artifact. a, Purple pen. b, Orange sticker. c, Nail. d, Hair. e, Dermoscopic border. Figures adapted from ref. 11,12 with permission and from ref. 13,14 under CC-BY license.

Extended Data Fig. 2 Concept bottleneck models in the interinstitution model transfer setting.

ad, We train models using images from Hospital Barcelona and test their performance using images from Med. U. Vienna. In each setting, we repeated the evaluations 20 times, using a different split of images from the training site into train and validation sets each time. ab, Performance comparison of malignancy and melanoma prediction models. We measure the validation AUROC on a 25% validation set from Hospital Barcelona data. We measure the test AUROC on the entire Med. U. Vienna data. cd, Coefficient of the linear model in MONET + CBM for malignancy and melanoma predictions. The error bars, obtained from n = 20 different runs, indicate the 95% confidence interval, extending from the mean. eh, We train models using images from Med. U. Vienna and test their performance using images from Hospital Barcelona. ef, Performance comparison of malignancy and melanoma prediction models. We measure the validation AUROC on a 25% validation set from Med. U. Vienna data. We measure the test AUROC on the entire Hospital Barcelona data. gh, Coefficient of the linear model in MONET + CBM for malignancy and melanoma predictions. The error bars, obtained from n = 20 different runs, indicate the 95% confidence interval, extending from the mean.

Extended Data Fig. 3 Effect of concept sets on MONET + CBM’s performance.

We compare CBMs operating on our curated, task-relevant, concepts with those operating on SkinCon concepts. From each concept list, we sample subsets of concepts with varying sizes and train CBMs separately on these subsets, repeating this process 20 times for each subset size. For our 10 curated concepts, we use the subset size ranging from 1 to 10, and for SkinCon concepts, we use the subset size of interval 5, including 1 and the full set of 48 (that is, 1, 5, 10, …, 40, 45, 48). The center line indicates the mean across n = 20 runs, with the shaded area covering the 95% confidence interval. a, Performance of MONET + CBM for malignancy and melanoma predictions on clinical images with respect to the number of concepts. b, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images with respect to the number of concepts. c, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images in an interinstitution model transfer setting with respect to the number of concepts. We train models on Hospital Barcelona and test them on Med. U. Vienna. d, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images in an interinstitution model transfer setting with respect to the number of concepts. We train models on Med. U. Vienna and test them on Hospital Barcelona.

Supplementary information

Supplementary Information

Supplementary Methods, Discussion, Figs. 1–10 and Tables 1–8.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, C., Gadgil, S.U., DeGrave, A.J. et al. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nat Med 30, 1154–1165 (2024). https://doi.org/10.1038/s41591-024-02887-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-024-02887-x

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing