Transparent medical image AI via an image–text foundation model grounded in medical literature

Kim, Chanwoo; Gadgil, Soham U.; DeGrave, Alex J.; Omiye, Jesutofunmi A.; Cai, Zhuo Ran; Daneshjou, Roxana; Lee, Su-In

doi:10.1038/s41591-024-02887-x

Article
Published: 16 April 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Nature Medicine volume 30, pages 1154–1165 (2024)Cite this article

2545 Accesses
53 Altmetric
Metrics details

Subjects

Abstract

Building trustworthy and transparent image-based medical artificial intelligence (AI) systems requires the ability to interrogate data and models at all stages of the development pipeline, from training models to post-deployment monitoring. Ideally, the data and associated AI systems could be described using terms already familiar to physicians, but this requires medical datasets densely annotated with semantically meaningful concepts. In the present study, we present a foundation model approach, named MONET (medical concept retriever), which learns how to connect medical images with text and densely scores images on concept presence to enable important tasks in medical AI development and deployment such as data auditing, model auditing and model interpretation. Dermatology provides a demanding use case for the versatility of MONET, due to the heterogeneity in diseases, skin tones and imaging modalities. We trained MONET based on 105,550 dermatological images paired with natural language descriptions from a large collection of medical literature. MONET can accurately annotate concepts across dermatology images as verified by board-certified dermatologists, competitively with supervised models built on previously concept-annotated dermatology datasets of clinical images. We demonstrate how MONET enables AI transparency across the entire AI system development pipeline, from building inherently interpretable models to dataset and model auditing, including a case study dissecting the results of an AI clinical trial.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of MONET framework and its usage examples.**

**Fig. 2: Images with high concept presence scores calculated using MONET.**

**Fig. 3: Concept-level data auditing.**

**Fig. 4: Concept-level model auditing.**

A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

Article Open access 28 February 2024

Foundation models for generalist medical artificial intelligence

Article 12 April 2023

Human–computer collaboration for skin cancer recognition

Article 22 June 2020

Data availability

The PMC Open Access Subset is publicly available from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist. Evaluation datasets are all publicly available and can be accessed from: ISIC (https://challenge.isic-archive.com/data), Derm7pt (https://derm.cs.sfu.ca), Fitzpatrick 17k (https://github.com/mattgroh/fitzpatrick17k) and DDI (https://stanfordaimi.azurewebsites.net/datasets/35866158-8196-48d8-87bf-50dca81df965).

Code availability

The code used in our analysis is available at https://github.com/suinleelab/MONET (ref. ⁸⁴). It includes various scripts for data collection and preprocessing, training the MONET model and conducting benchmark studies. Also, it provides the MONET model weights. The ADAE algorithm can be publicly accessed from https://github.com/ISIC-Research/ADAE.

References

Daneshjou, R., Yuksekgonul, M., Cai, Z. R., Novoa, R. & Zou, J. Y. SkinCon: a skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 18157–18167 (Curran Associates, Inc., 2022).
Mendonça, T., Ferreira, P. M., Marques, J. S., Marcal, A. R. & Rozeira, J. PH 2-A dermoscopic image database for research and benchmarking. In 35th Annual International Conference of the IEEE 5437–5440 (Engineering in Medicine and Biology Society, 2013).
Kawahara, J., Daneshvar, S., Argenziano, G. & Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE J. Biomed. Health Inform. 23, 538–546 (2019).
Article Google Scholar
Nevitt, M., Felson, D. & Lester, G. The Osteoarthritis Initiative. Protocol for the cohort study V 1.1 6.21.06 (accessed 1 Nov 2023); https://nda.nih.gov/static/docs/StudyDesignProtocolAndAppendices.pdf
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Groh, M. et al. Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 1820–1828 (IEEE, 2021).
Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Article PubMed PubMed Central Google Scholar
Gutman, D. et al. Skin lesion analysis toward melanoma detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). Preprint at https://arxiv.org/abs/1605.01397 (2016).
Codella, N. C. F. et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) 168–172 (ISBI, 2018).
Codella, N. et al. Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the International Skin Imaging Collaboration (ISIC). Preprint at https://arxiv.org/abs/1902.03368 (2019).
Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data https://doi.org/10.1038/sdata.2018.161 (2018).
Combalia, M. et al. BCN20000: dermoscopic lesions in the wild. Preprint at https://arxiv.org/abs/1908.02288 (2019).
Rotemberg, V. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data https://doi.org/10.1038/s41597-021-00815-z (2021).
Memorial Sloan Kettering Cancer Center. Consecutive biopsies for melanoma across year 2020. ISIC Archive https://doi.org/10.34970/151324 (2022).
Article Google Scholar
Marchetti, M. A. et al. Prospective validation of dermoscopy-based open-source artificial intelligence for melanoma diagnosis (PROVE-AI study). npj Digit. Med. 6, 127 (2023).
Article PubMed PubMed Central Google Scholar
Ricci Lara, M. A. et al. A dataset of skin lesion images collected in Argentina for the evaluation of AI tools in this population. Sci. Data 10, 712 (2023).
Article PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-022-00936-9 (2022).
Agbai, O. N. et al. Skin cancer and photoprotection in people of color: a review and recommendations for physicians and the public. J. Am. Acad. Dermatol. 70, 748–762 (2014).
Article PubMed Google Scholar
Sierro, T. J. et al. Differences in health care resource utilization and costs for keratinocyte carcinoma among racioethnic groups: a population-based study. J. Am. Acad. Dermatol. 86, 373–378 (2022).
Article PubMed Google Scholar
DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
Article Google Scholar
Janizek, J. D., Erion, G., DeGrave, A. J. & Lee, S.-I. An adversarial approach for the robust classification of pneumonia from chest radiographs. In Proc. ACM Conference on Health, Inference, and Learning (ed. Ghassemi, M.) 69–79 (Association for Computing Machinery, 2020).
Bissoto, A., Fornaciali, M., Valle, E. & Avila, S. (De) Constructing bias on skin lesion datasets. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2766–2774 (IEEE, 2019).
Cassidy, B., Kendrick, C., Brodzicki, A., Jaworek-Korjakowska, J. & Yap, M. H. Analysis of the ISIC image datasets: usage, benchmarks and recommendations. Med. Image Anal. 75, 102305 (2022).
Article PubMed Google Scholar
Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).
Article PubMed PubMed Central Google Scholar
Navarrete-Dechent, C., Liopyris, K. & Marchetti, M. A. Multiclass artificial intelligence in dermatology: progress but still room for improvement. J. Invest. Dermatol. 141, 1325–1328 (2021).
Article CAS PubMed Google Scholar
Daneshjou, R., Smith, M. P., Sun, M. D., Rotemberg, V. & Zou, J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review. JAMA Dermatol. 157, 1362–1369 (2021).
Article PubMed PubMed Central Google Scholar
Massie, J. P. et al. Patient representation in medical literature: are we appropriately depicting diversity? Plast. Reconstr. Surg. Glob. Open 7, e2563 (2019).
Article PubMed PubMed Central Google Scholar
Massie, J. P. et al. A picture of modern medicine: race and visual representation in medical literature. J. Natl Med. Assoc. 113, 88–94 (2021).
PubMed Google Scholar
Lester, J., Jia, J., Zhang, L., Okoye, G. & Linos, E. Absence of images of skin of colour in publications of COVID19 skin manifestations. Br. J. Dermatol. 183, 593–595 (2020).
Article CAS PubMed PubMed Central Google Scholar
Louie, P. & Wilkes, R. Representations of race and skin tone in medical textbook imagery. Soc. Sci. Med. 202, 38–42 (2018).
Article PubMed Google Scholar
Groh, M., Harris, C., Daneshjou, R., Badri, O. & Koochek, A. Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm. In Proc. ACM Human Computer Interactions 1–26 (Association for Computing Machinery, 2022).
Rajpurkar, P. et al. MURA: large dataset for abnormality detection in musculoskeletal radiographs. In 1st Conference on Medical Imaging with Deep Learning (MIDL, 2018).
Singh, C., Balakrishnan, G. & Perona, P. Matched sample selection with GANs for mitigating attribute confounding. Preprint at https://arxiv.org/abs/2103.13455 (2021).
Leming, M., Das, S. & Im, H. Construction of a confounder-free clinical MRI dataset in the Mass General Brigham system for classification of Alzheimer’s disease. Artif. Intell. Med. 129, 102309 (2022).
Article PubMed PubMed Central Google Scholar
Zhao, Q., Adeli, E. & Pohl, K. M. Training confounder-free deep learning models for medical applications. Nat. Commun. 11, 6010 (2020).
Goel, K., Gu, A., Li, Y. & Ré, C. Model patching: closing the subgroup performance gap with data augmentation. In 9th International Conference on Learning Representations (ICLR, 2021); https://openreview.net/forum?id=9YlaeLfuhJF
Sagawa, S., Koh, P. W., Hashimoto, T. B. & Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations (ICLR, 2020); https://dblp.org/rec/conf/iclr/SagawaKHL20.html?view=bibtex
Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Re, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proc. ACM Conference on Health, Inference, and Learning ACM CHIL ’20 (ed. Ghassemi, M.) 151–159 (ACM, 2020).
Zhu, J., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017 2242–2251 (IEEE Computer Society, 2017).
Jones, O. T. et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit. Health 4, e466–e476 (2022).
Article CAS PubMed Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 4768–4777 (Curran Associates Inc., 2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV) 618–626 (IEEE, 2017).
Kim, B. et al. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In Proc. 35th International Conference on Machine Learning (eds Dy, J. G. & Krause, A.) 2668–2677 (PMLR, 2018).
Crabbé, J. & van der Schaar, M. Concept activation regions: a generalized framework for concept-based explanations. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 2590–2607 (Curran Associates Inc., 2022).
Abid, A., Yuksekgonul, M. & Zou, J. Meaningfully debugging model mistakes using conceptual counterfactual explanations. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 66–68 (PMLR, 2022).
Eyuboglu, S, et al. Domino: discovering systematic errors with cross-modal embeddings. In 10th International Conference on Learning Representations, ICLR 2022 (ICLR, 2022); https://openreview.net/forum?id=FPCMqjI0jXN
Chung, Y., Kraska, T., Polyzotis, N., Tae, K. & Whang, S. Automated data slicing for model validation: a big data–AI integration approach. In IEEE Transactions on Knowledge and Data Engineering 2284–2296 (IEEE, 2020).
DeGrave, A. J., Cai, Z. R., Janizek, J. D., Daneshjou, R. & Lee, S.-I. Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-023-01160-9 (2023).
Reyes, M. et al. On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol. Artific. Intell. 2, e190043 (2020).
Article Google Scholar
Arun, N. et al. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiol. Artific. Intell. 3, e200267 (2021).
Article Google Scholar
Han, S. S. et al. The degradation of performance of a state-of-the-art skin image classifier when applied to patient-driven internet search. Sci. Rep. 12, 16260 (2022).
Navarrete-Dechent, C. et al. Automated dermatological diagnosis: hype or reality? J. Invest. Dermatol. 138, 2277–2279 (2018).
Koh, P. W. et al. Concept bottleneck models. In Proc. 37th International Conference on Machine Learning International Conference on Machine Learning 5338–5348 (PMLR, 2020).
Yuksekgonul, M., Wang, M. & Zou, J. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations (ICLR, 2023); https://dblp.org/rec/conf/iclr/YuksekgonulW023.html?view=bibtex
Rigel, D. S., Friedman, R. J., Kopf, A. W. & Polsky, D. ABCDE—an evolving concept in the early detection of melanoma. Arch. Dermatol. 141, 1032–1034 (2005).
Article PubMed Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3876–3887 (Association for Computational Linguistics, 2022).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Combalia, M. et al. Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge. Lancet Digit. Health 4, e330–e339 (2022).
Article CAS PubMed PubMed Central Google Scholar
Corbin, C. K. et al. DEPLOYR: a technical framework for deploying custom real-time machine learning models into the electronic medical record. J. Am. Med. Inform. Assoc. 30, 1532–1542 (2023).
Article PubMed PubMed Central Google Scholar
Coalition for Health AI. Blueprint for Tustworthy AI Implementation Guidance and Assurance for Healthcare (MITRE Corporation, 2023); https://tinyurl.com/CHAI-paper
Bedoya, A. D. et al. A framework for the oversight and local deployment of safe and high-quality prediction models. J. Am. Med. Inform. Assoc. 29, 1631–1636 (2022).
Article PubMed PubMed Central Google Scholar
Pianykh, O. S. et al. Continuous learning AI in radiology: implementation principles and early applications. Radiology 297, 6–14 (2020).
Article PubMed Google Scholar
Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 66 (2022).
Article PubMed PubMed Central Google Scholar
Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).
Article CAS PubMed Google Scholar
PMC Open Access Subset. National Library of Medicine www.ncbi.nlm.nih.gov/pmc/tools/openftlist (2022).
Gamper, J. & Rajpoot, N. M. Multiple instance captioning: learning representations from histopathology textbooks and articles. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021 16549–16559 (Computer Vision Foundation/IEEE, 2021).
Huang, G., Liu, Z., Maaten, L. V. D. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (IEEE Computer Society, 2017).
Tan, M. & Le, Q. V. EfficientNetV2: smaller models and faster training. In Proc. 38th International Conference on Machine Learning, ICML 2021 (eds Meila, M. & Zhang, T.) 10096–10106 (PMLR, 2021).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021 (ICLR, 2021); https://openreview.net/forum?id=YicbFdNTTy
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009) 248–255 (IEEE Computer Society, 2009).
Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Proc. 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers 1715–1725 (Association for Computational Linguistics, 2016).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Machine Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Informa. Theory 28, 129–137 (1982).
Article Google Scholar
Lanchantin, J., Wang, T., Ordonez, V. & Qi, Y. General multi-label image classification with transformers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16478–16488 (IEEE, 2021).
Jeyakumar, J. V. et al. Automatic concept extraction for concept bottleneck-based video classification. Preprint at https://arxiv.org/abs/2206.10129 (2022).
Sun, X. et al. Interpreting deep learning models in natural language processing: a review. Preprint at https://arxiv.org/abs/2110.10470 (2021).
Klimiene, U. et al. Multiview concept bottleneck models applied to diagnosing pediatric appendicitis. In 2nd Workshop on Interpretable Machine Learning in Healthcare (IMLH, 2022).
Wu, C., Parbhoo, S., Havasi, M. & Doshi-Velez, F. Learning optimal summaries of clinical time-series with concept bottleneck models. In Proc. 7th Machine Learning for Healthcare Conference (eds Lipton, Z. C. et al.) 648–672 (PMLR, 2022).
suinleelab/MONET. Github https://github.com/suinleelab/MONET (2024).

Download references

Acknowledgements

We thank C. Lin and other people in S.-I.L.’s lab for helpful discussions. The members of S.-I.L.’s lab, including C.K., S.U.G., A.J.D. and S.-I.L., received support from the National Science Foundation (grant nos. CAREER DBI-1552309 and DBI-1759487) and the National Institutes of Health (NIH, grant nos. R35 GM 128638 and R01 AG061132). R.D. was supported by the NIH (5T32 AR007422-38) and the Stanford Catalyst Program.

Author information

These authors jointly supervised this work: Roxana Daneshjou, Su-In Lee.

Authors and Affiliations

Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
Chanwoo Kim, Soham U. Gadgil, Alex J. DeGrave & Su-In Lee
Medical Scientist Training Program, University of Washington, Seattle, WA, USA
Alex J. DeGrave
Department of Dermatology, Stanford School of Medicine, Stanford, CA, USA
Jesutofunmi A. Omiye & Roxana Daneshjou
Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA
Jesutofunmi A. Omiye & Roxana Daneshjou
Program for Clinical Research and Technology, Stanford University, Stanford, CA, USA
Zhuo Ran Cai

Authors

Chanwoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Soham U. Gadgil
View author publications
You can also search for this author in PubMed Google Scholar
Alex J. DeGrave
View author publications
You can also search for this author in PubMed Google Scholar
Jesutofunmi A. Omiye
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Ran Cai
View author publications
You can also search for this author in PubMed Google Scholar
Roxana Daneshjou
View author publications
You can also search for this author in PubMed Google Scholar
Su-In Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.K., R.D. and S.-I.L. conceived the initial study. C.K., S.U.G., A.J.D. and J.A.O. performed the experiments. Z.R.C. and R.D. evaluated the training data, provided dermatological insights and clinical context in all steps of the analyses, and analyzed images from concept retrieval experiments. C.K., S.U.G., A.J.D., J.A.O., Z.R.C., R.D. and S.-I.L. wrote the paper. S.-I.L. secured funding. R.D. and S.-I.L. co-supervised the study.

Corresponding authors

Correspondence to Roxana Daneshjou or Su-In Lee.

Ethics declarations

Competing interests

R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA and VisualDx for consulting, stock options from MDAcne and Revea for advisory board and research funding from UCB. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Titus Brinker and Ben Glocker for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Dermoscopic images with artifacts from the ISIC dataset as determined by high concept presence scores calculated using MONET.

We show the top 30 images for each artifact. a, Purple pen. b, Orange sticker. c, Nail. d, Hair. e, Dermoscopic border. Figures adapted from ref. ^11,12 with permission and from ref. ^13,14 under CC-BY license.

Extended Data Fig. 2 Concept bottleneck models in the interinstitution model transfer setting.

a–d, We train models using images from Hospital Barcelona and test their performance using images from Med. U. Vienna. In each setting, we repeated the evaluations 20 times, using a different split of images from the training site into train and validation sets each time. a–b, Performance comparison of malignancy and melanoma prediction models. We measure the validation AUROC on a 25% validation set from Hospital Barcelona data. We measure the test AUROC on the entire Med. U. Vienna data. c–d, Coefficient of the linear model in MONET + CBM for malignancy and melanoma predictions. The error bars, obtained from n = 20 different runs, indicate the 95% confidence interval, extending from the mean. e–h, We train models using images from Med. U. Vienna and test their performance using images from Hospital Barcelona. e–f, Performance comparison of malignancy and melanoma prediction models. We measure the validation AUROC on a 25% validation set from Med. U. Vienna data. We measure the test AUROC on the entire Hospital Barcelona data. g–h, Coefficient of the linear model in MONET + CBM for malignancy and melanoma predictions. The error bars, obtained from n = 20 different runs, indicate the 95% confidence interval, extending from the mean.

Extended Data Fig. 3 Effect of concept sets on MONET + CBM’s performance.

We compare CBMs operating on our curated, task-relevant, concepts with those operating on SkinCon concepts. From each concept list, we sample subsets of concepts with varying sizes and train CBMs separately on these subsets, repeating this process 20 times for each subset size. For our 10 curated concepts, we use the subset size ranging from 1 to 10, and for SkinCon concepts, we use the subset size of interval 5, including 1 and the full set of 48 (that is, 1, 5, 10, …, 40, 45, 48). The center line indicates the mean across n = 20 runs, with the shaded area covering the 95% confidence interval. a, Performance of MONET + CBM for malignancy and melanoma predictions on clinical images with respect to the number of concepts. b, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images with respect to the number of concepts. c, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images in an interinstitution model transfer setting with respect to the number of concepts. We train models on Hospital Barcelona and test them on Med. U. Vienna. d, Performance of MONET + CBM for malignancy and melanoma predictions on dermoscopic images in an interinstitution model transfer setting with respect to the number of concepts. We train models on Med. U. Vienna and test them on Hospital Barcelona.

Supplementary information

Supplementary Information

Supplementary Methods, Discussion, Figs. 1–10 and Tables 1–8.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kim, C., Gadgil, S.U., DeGrave, A.J. et al. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nat Med 30, 1154–1165 (2024). https://doi.org/10.1038/s41591-024-02887-x

Download citation

Received: 09 June 2023
Accepted: 27 February 2024
Published: 16 April 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41591-024-02887-x

Transparent medical image AI via an image–text foundation model grounded in medical literature

Subjects

Abstract

Access options

Similar content being viewed by others

A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

Foundation models for generalist medical artificial intelligence

Human–computer collaboration for skin cancer recognition

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 Dermoscopic images with artifacts from the ISIC dataset as determined by high concept presence scores calculated using MONET.

Extended Data Fig. 2 Concept bottleneck models in the interinstitution model transfer setting.

Extended Data Fig. 3 Effect of concept sets on MONET + CBM’s performance.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links