Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians

DeGrave, Alex J.; Cai, Zhuo Ran; Janizek, Joseph D.; Daneshjou, Roxana; Lee, Su-In

doi:10.1038/s41551-023-01160-9

Article
Published: 28 December 2023

Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians

Nature Biomedical Engineering (2023)Cite this article

4012 Accesses
4 Citations
78 Altmetric
Metrics details

Subjects

Abstract

The inferences of most machine-learning models powering medical artificial intelligence are difficult to interpret. Here we report a general framework for model auditing that combines insights from medical experts with a highly expressive form of explainable artificial intelligence. Specifically, we leveraged the expertise of dermatologists for the clinical task of differentiating melanomas from melanoma ‘lookalikes’ on the basis of dermoscopic and clinical images of the skin, and the power of generative models to render ‘counterfactual’ images to understand the ‘reasoning’ processes of five medical-image classifiers. By altering image attributes to produce analogous images that elicit a different prediction by the classifiers, and by asking physicians to identify medically meaningful features in the images, the counterfactual images revealed that the classifiers rely both on features used by human dermatologists, such as lesional pigmentation patterns, and on undesirable features, such as background skin texture and colour balance. The framework can be applied to any specialized medical domain to make the powerful inference processes of machine-learning models medically understandable.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of joint expert, XAI auditing procedure and audited AI classifiers.**

**Fig. 2: Joint expert and XAI auditing procedure reveals reasoning processes of dermatology AI classifiers.**

**Fig. 3: Experimental validation of findings from expert analysis of counterfactual images.**

**Fig. 4: Explanations of failure cases of dermatology AI classifiers, illustrating key findings from our systematic analysis.**

Segment anything in medical images

Article Open access 22 January 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

Demographic bias in misdiagnosis by computational pathology models

Article 19 April 2024

Data availability

The images used in this study were obtained from publicly available repositories. ISIC images are available at https://challenge.isic-archive.com/data. Fitzpatrick17k images are available at https://github.com/mattgroh/fitzpatrick17k. The DDI images are available at https://stanfordaimi.azurewebsites.net/datasets/35866158-8196-48d8-87bf-50dca81df965. Model weights for the DeepDerm classifier are available at https://zenodo.org/record/6784279#.ZFrDc9LMK-Z. The weights and model specification for the ModelDerm classifier are available at https://figshare.com/articles/Caffemodel_files_and_Python_Examples/5406223. Model weights for our retrained variant of the SIIM-ISIC competition classifier are available at https://zenodo.org/doi/10.5281/zenodo.10049216. Scanoma and Smart Skin Cancer Detection are third-party software for which we cannot redistribute model weights. At the time of writing, both are apps that are available for download with no fee from the Google Play store and from third-party APK-package download sites.

Code availability

Custom codes, including a PyTorch implementation of explanation by progressive exaggeration and of classes for loading datasets and classifiers, are available at https://github.com/suinleelab/derm_audit. The weights for the trained generative models and the re-trained SIIM-ISIC classifier are available at https://zenodo.org/doi/10.5281/zenodo.10049216.

References

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
Article CAS PubMed Google Scholar
Reddy, S. Explainability and artificial intelligence in medicine. Lancet Digit. Health 4, E214–E215 (2022).
Article CAS PubMed Google Scholar
Young, A. T. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digit. Med. 4, 10 (2021).
Article PubMed PubMed Central Google Scholar
DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
Article Google Scholar
Singh, N. et al. Agreement between saliency maps and human-labeled regions of interest: applications to skin disease classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3172–3181 (IEEE, 2020).
Bissoto, A., Fornaciali, M., Valle, E. & Avila, S. (De) constructing bias on skin lesion datasets. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2766–2774 (IEEE, 2019).
Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).
Article PubMed PubMed Central Google Scholar
Singla, S., Pollack, B., Chen, J. & Batmanghelich, K. Explanation by progressive exaggeration. In International Conference on Learning Representations (ICLR, 2020).
Mertes, S., Huber, T., Weitz, K., Heimerl, A., & Andr, E. GANterfactual—counterfactual explanations for medical non-experts using generative adversarial learning. Front. Artif. Intell. 5, 825565 (2022).
Article PubMed PubMed Central Google Scholar
Ghoshal, B. & Tucker, A. Estimating uncertainty and interpretability in deep learning for coronavirus (COVID-19) detection. Preprint at arXiv:2003.10769 (2020).
Ozturk, T. et al. Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 121, 103792 (2020).
Article CAS PubMed PubMed Central Google Scholar
Brunese, L., Mercaldo, F., Reginelli, A. & Santone, A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput. Methods Programs Biomed. 196, 105608 (2020).
Article PubMed PubMed Central Google Scholar
Karim, M. et al. DeepCOVIDExplainer: explainable COVID-19 diagnosis from chest X-ray images. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1034–1037 (IEEE, 2020).
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Article Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26, 900–908 (2020).
Article CAS PubMed Google Scholar
Han, S. S. et al. Augmented intellignece dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J. Invest. Dermatol. 140, 1753–1761 (2020).
Article CAS PubMed Google Scholar
Sun, M. D. et al. Accuracy of commercially available smartphone applications for the detection of melanoma. Br. J. Dermatol. 186, 744–746 (2022).
Article CAS PubMed PubMed Central Google Scholar
Freeman, K. et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. Br. Med. J. 368, m127 (2020).
Article Google Scholar
Beltrami, E. J. et al. Artificial intelligence in the detection of skin cancer. J. Am. Acad. Dermatol. 87, 1336–1342 (2022).
Article PubMed Google Scholar
Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Article PubMed PubMed Central Google Scholar
Han, S. S. et al. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J. Invest. Dermatol. 138, 1529–1538 (2018).
Article CAS PubMed Google Scholar
Ha, Q., Liu, B. & Liu, F. Identifying melanoma images using EfficientNet ensemble: winning solution to the SIIM-ISIC melanoma classification challenge. Preprint at arXiv:2010.05351 (2020).
Rotemberg, V. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data 8, 34 (2021).
Article PubMed PubMed Central Google Scholar
Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018).
Article PubMed PubMed Central Google Scholar
Combalia, M. et al. BCN20000: dermoscopic lesions in the wild. Preprint at arXiv:1908.02288 (2019).
Groh, M. et al. Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In Proceedings of the Computer Vision and Pattern Recognition (CVPR) Sixth ISIC Skin Image Analysis Workshop (IEEE, 2021).
Karras, T. et al. Analyzing and improving the image quality of StyleGAN. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8107–8116 (IEEE, 2020).
Shi, K. et al. A retrospective cohort study of the diagnostic value of different subtypes of atypical pigment network on dermoscopy. J. Am. Acad. Dermatol. 83, 1028–1034 (2020).
Article PubMed Google Scholar
Yélamos, O. et al. Usefulness of dermoscopy to improve the clinical and histopathologic diagnosis of skin cancers. J. Am. Acad. Dermatol. 80, 365–377 (2019).
Article PubMed Google Scholar
Halpern, A. C., Marghoob, A. A. & Reiter, O. Melanoma Warning Signs: What You Need to Know About Early Signs of Skin Cancer (Skin Cancer Foundation, 2021); https://www.skincancer.org/skin-cancer-information/melanoma/melanoma-warningsigns-and-images/. Accessed April 2023.
Massi, D., De Giorgi, V., Carli, P. & Santucci, M. Diagnostic significance of the blue hue in dermoscopy of melanocytic lesions: a dermoscopic-pathologic study. Am. J. Dermatopathol. 23, 463–469 (2001).
Article CAS PubMed Google Scholar
Marghoob, N. G., Liopyris, K. & Jaimes, N. Dermoscopy: a review of the structures that facilitate melanoma detection. J. Osteopath. Med. 119, 380–390 (2019).
Article Google Scholar
Oliveria, S. A., Saraiya, M., Geller, A. C., Heneghan, M. K. & Jorgensen, C. Sun exposure and risk of melanoma. Arch. Dis. Child. 91, 131–138 (2006).
Article CAS PubMed Google Scholar
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) 2223–2232 (IEEE, 2017).
Illumination, I. C. on. ISO/CIE 11664-5:2016(e) Colorimetry—part 5: CIE 1976 L*u*v* colour space and u’, v’ uniform chromaticity scale diagram (2016).
Deng, Z., Gijsenij, A. & Zhang, J. Source camera identification using auto-white balance approximation. In 2011 IEEE International Conference on Computer Vision 57–64 (IEEE, 2011).
Rader, R. K. et al. The pink rim sign: location of pink as an indicator of melanoma in dermoscopic images. J. Skin Cancer 2014, 719740 (2014).
Article PubMed PubMed Central Google Scholar
Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).
Article CAS PubMed Google Scholar
Tschandl, P. et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based international, diagnostic study. Lancet Oncol. 20, 938–947 (2019).
Article PubMed PubMed Central Google Scholar
Weber, P., Sinz, C., Rinner, C., Kittler, H. & Tschandl, P. Perilesional sun damage as a diagnostic clue for pigmented actinic keratosis and Bowen’s disease. J. Eur. Acad. Dermatol. Venereol. 35, 2022–2026 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fitzpatrick, J. E., High, W. A. & Kyle, W. L. Urgent Care Dermatology: Symptom-Based Diagnosis. 477–488 (Elsevier, 2018).
Wu, E. et al. Toward Stronger FDA Approval Standards for AI Medical Devices (Stanford University Human-centered Artificial Intelligence (2022).
Bansal, G. et al. Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (ACM, 2021).
Rok, R. & Weld, D. S. In search of verifiability: explanations rarely enable complementary performance in AI-advised decision making. Preprint at arXiv:2305.07722v3 (2023).
Roth, L. Looking at Shirley, the ultimate norm: colour balance, image technologies, and cognitive equity. Can. J. Commun. 34, 111–136 (2009).
Article Google Scholar
Lester, J. C., Clark, L., Linos, E. & Daneshjou, R. Clinical photography in skin of colour: tips and best practices. Br. J. Dermatol. 184, 1177–1179 (2021).
Article CAS PubMed Google Scholar
Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2, 158–164 (2018).
Article PubMed Google Scholar
Yamashita, T. et al. Factors in color fundus photographs that can be used by humans to determine sex of individuals. Transl Vis. Sci. Technol. 9, 4 (2020).
Article PubMed PubMed Central Google Scholar
Codella, N. C. F. et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI), 168–172 (IEEE, 2018).
Tan, M. et al. MnasNet: platform-aware neural architecture search for mobile. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2820–2828 (IEEE, 2019).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2704–2713 (IEEE, 2018)
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Tan, M. & Le, Q. EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019) 6105–6114 (PMLR, 2019).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 7132–7141 (IEEE, 2018).
Zhang, H. et al. ResNeSt: split-attention networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2735–2745 (IEEE, 2022).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 (IEEE, 2016).
Giotis, I. et al. MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst. Appl. 42, 6578–6585 (2015).
Article Google Scholar

Download references

Acknowledgements

A.J.D., J.D.J., and S.-I.L. were supported by the National Science Foundation (CAREER DBI-1552309 and DBI-1759487) and the National Institutes of Health (R35 GM 128638 and R01 AG061132). R.D. was supported by the National Institutes of Health (5T32 AR007422-38) and the Stanford Catalyst Program.

Author information

These authors contributed equally: Roxana Daneshjou, Su-In Lee.

Authors and Affiliations

Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
Alex J. DeGrave, Joseph D. Janizek & Su-In Lee
Medical Scientist Training Program, University of Washington, Seattle, WA, USA
Alex J. DeGrave & Joseph D. Janizek
Program for Clinical Research and Technology, Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
Zhuo Ran Cai
Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
Roxana Daneshjou
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
Roxana Daneshjou

Authors

Alex J. DeGrave
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Ran Cai
View author publications
You can also search for this author in PubMed Google Scholar
Joseph D. Janizek
View author publications
You can also search for this author in PubMed Google Scholar
Roxana Daneshjou
View author publications
You can also search for this author in PubMed Google Scholar
Su-In Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.J.D., J.D.J., R.D. and S.-I.L. conceived the initial study. A.J.D. prepared data and developed software for the reproduction of dermatology AI classifiers, for their counterfactual analysis and for confirmatory experiments. A.J.D. and J.D.J. developed software for the generation of saliency maps. Z.R.C. and R.D. analysed counterfactual images and examined saliency maps. A.J.D., Z.R.C., J.D.J., R.D. and S.-I.L. analysed data and designed additional experiments. Z.R.C. and R.D. provided dermatological insights and clinical context. A.J.D., Z.R.C., J.D.J., R.D., and S.-I.L. wrote the manuscript. S.-I.L. secured funding, and R.D. and S.-I.L. jointly supervised the study.

Corresponding authors

Correspondence to Roxana Daneshjou or Su-In Lee.

Ethics declarations

Competing interests

R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA and VisualDx for consulting; stock options from MDAcne and Revea for advisory board; and research funding from UCB. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of insights from counterfactuals and saliency maps.

We calculated feature attributions using three popular techniques, Expected Gradients, Kernel SHAP, and GradCAM (see Supplementary Methods) and then produced our best-effort visualizations of the resulting saliency maps. We failed to gather insights from the saliency maps, except that the AI classifier may focus on the lesion (but perhaps not always, depending on the saliency technique). In contrast, the counterfactuals provided more granular and medically interpretable insights: for instance, based on the malignant counterfactuals we inferred that multiple colors of pigment (top + bottom), erythema (middle + bottom), darker pigmentation (all), and blue-white veil (bottom) tend to elicit more malignant predictions. In this figure, all saliency maps and counterfactuals were generated in reference to our AI classifier ‘SIIM-ISIC’. Figure adapted with permission from ref. ²⁵, ViDIR Group, Department of Dermatology, Medical University of Vienna.

Extended Data Fig. 2 Attributes identified by the joint expert–XAI auditing procedure as key influences on the output of individual dermatology AI classifiers, when evaluated on the ISIC dataset.

In contrast to main text Fig. 2, attributes are ordered by the proportion of counterfactual pairs from the specified AI classifier in which experts noted that attribute differs, enabling examination of attributes relevant to a particular AI classifier but not necessarily to most AI classifiers (for example, prominence of skin grooves or dermatoglyphs, which influences Scanoma and ModelDerm).

Extended Data Fig. 3 Attributes identified by the join expert–XAI auditing procedure as key influences on the output of individual dermatology AI classifiers, when evaluated on the Fitzpatrick17k dataset.

In contrast to main text Fig. 2, attributes are ordered by the proportion of counterfactual pairs from the specified AI classifier in which experts noted that attribute differs, enabling examination of attributes relevant to a particular AI classifier but not necessarily to other AI classifiers.

Extended Data Fig. 4 Analysis of inter-reader variability, displaying the two readers’ individual conclusions side-by-side for each attribute.

For each reader, we separately determine whether that attribute was 'predominant' in benign or malignant counterfactuals, that is, present to a greater extent in benign (malignant) counterfactuals in at least twice as many images as malignant (benign) counterfactuals. The size of each rectangle (the 'fraction of counterfactual pairs') is then determined as the proportion of counterfactual pairs with a difference noted in the predominant direction, for that reader alone. While readers typically do not attain quantitative agreement on the fraction of counterfactual pairs for a given attribute, the presence and direction of an attribute’s effect typically remains consistent. For conciseness, attribute names are shortened as described in Supplementary Table 1.

Extended Data Fig. 5 Effect of the programmatic modification of image brightness on the predictions of the AI classifier.

We separately applied three methods of image brightness modification (see Supplementary Methods), then calculated the mean change in AI classifier output relative to the original, unaltered images. For modifications in linear RGB or J_za_zb_z space, we modified brightness by applying a multiplicative factor B = 2ⁿ; we display AI classifier responses as a function of n. For modifications in CIELUV space, we add a constant ΔL* to the perceptual lightness L*, where the maximum value of L* is 100. To facilitate visualization, the vertical axis is normalized to the maximum absolute change in AI classifier output observed for a given method; the normalization factors are displayed at bottom right. Images indicate the effect of each given brightness modification.

Extended Data Fig. 6 Effect of the programmatic modification of image chromaticity on the predictions of the AI classifier.

We separately applied three methods of image chromaticity modification (see Supplementary Methods), then calculated the mean change in AI classifier output relative to the original, unaltered images. Each method of chromaticity modification reflects the chromatic adaptation transform (white balancing method) provided by the corresponding color appearance model (CIE 1976 L* u* v*, CIE 1976 L* a* b*, or CAM16). To facilitate visualization, the vertical axis is normalized to the maximum absolute change in AI classifier output observed for a given method; the normalization factors are displayed at bottom right. Images indicate the effect of each given chromaticity modification. Color bars indicate the hue to which a neutral color (white) is shifted by the chromaticity modification; colorfulness in the color bar (but not example images) is exaggerated for ease of viewing.

Supplementary Information

Supplementary methods, Tables 1–4, Figs. 1–9 and references.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

DeGrave, A.J., Cai, Z.R., Janizek, J.D. et al. Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians. Nat. Biomed. Eng (2023). https://doi.org/10.1038/s41551-023-01160-9

Download citation

Received: 30 October 2023
Accepted: 30 October 2023
Published: 28 December 2023
DOI: https://doi.org/10.1038/s41551-023-01160-9

This article is cited by

Hail the AI journal editor

Nature Biomedical Engineering (2024)
Transparent medical image AI via an image–text foundation model grounded in medical literature
- Chanwoo Kim
- Soham U. Gadgil
- Su-In Lee
Nature Medicine (2024)