Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians

Abstract

The inferences of most machine-learning models powering medical artificial intelligence are difficult to interpret. Here we report a general framework for model auditing that combines insights from medical experts with a highly expressive form of explainable artificial intelligence. Specifically, we leveraged the expertise of dermatologists for the clinical task of differentiating melanomas from melanoma ‘lookalikes’ on the basis of dermoscopic and clinical images of the skin, and the power of generative models to render ‘counterfactual’ images to understand the ‘reasoning’ processes of five medical-image classifiers. By altering image attributes to produce analogous images that elicit a different prediction by the classifiers, and by asking physicians to identify medically meaningful features in the images, the counterfactual images revealed that the classifiers rely both on features used by human dermatologists, such as lesional pigmentation patterns, and on undesirable features, such as background skin texture and colour balance. The framework can be applied to any specialized medical domain to make the powerful inference processes of machine-learning models medically understandable.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of joint expert, XAI auditing procedure and audited AI classifiers.
Fig. 2: Joint expert and XAI auditing procedure reveals reasoning processes of dermatology AI classifiers.
Fig. 3: Experimental validation of findings from expert analysis of counterfactual images.
Fig. 4: Explanations of failure cases of dermatology AI classifiers, illustrating key findings from our systematic analysis.

Similar content being viewed by others

Data availability

The images used in this study were obtained from publicly available repositories. ISIC images are available at https://challenge.isic-archive.com/data. Fitzpatrick17k images are available at https://github.com/mattgroh/fitzpatrick17k. The DDI images are available at https://stanfordaimi.azurewebsites.net/datasets/35866158-8196-48d8-87bf-50dca81df965. Model weights for the DeepDerm classifier are available at https://zenodo.org/record/6784279#.ZFrDc9LMK-Z. The weights and model specification for the ModelDerm classifier are available at https://figshare.com/articles/Caffemodel_files_and_Python_Examples/5406223. Model weights for our retrained variant of the SIIM-ISIC competition classifier are available at https://zenodo.org/doi/10.5281/zenodo.10049216. Scanoma and Smart Skin Cancer Detection are third-party software for which we cannot redistribute model weights. At the time of writing, both are apps that are available for download with no fee from the Google Play store and from third-party APK-package download sites.

Code availability

Custom codes, including a PyTorch implementation of explanation by progressive exaggeration and of classes for loading datasets and classifiers, are available at https://github.com/suinleelab/derm_audit. The weights for the trained generative models and the re-trained SIIM-ISIC classifier are available at https://zenodo.org/doi/10.5281/zenodo.10049216.

References

  1. Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).

    Article  CAS  PubMed  Google Scholar 

  2. Reddy, S. Explainability and artificial intelligence in medicine. Lancet Digit. Health 4, E214–E215 (2022).

    Article  CAS  PubMed  Google Scholar 

  3. Young, A. T. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digit. Med. 4, 10 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  4. DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).

    Article  Google Scholar 

  5. Singh, N. et al. Agreement between saliency maps and human-labeled regions of interest: applications to skin disease classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3172–3181 (IEEE, 2020).

  6. Bissoto, A., Fornaciali, M., Valle, E. & Avila, S. (De) constructing bias on skin lesion datasets. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2766–2774 (IEEE, 2019).

  7. Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Singla, S., Pollack, B., Chen, J. & Batmanghelich, K. Explanation by progressive exaggeration. In International Conference on Learning Representations (ICLR, 2020).

  9. Mertes, S., Huber, T., Weitz, K., Heimerl, A., & Andr, E. GANterfactual—counterfactual explanations for medical non-experts using generative adversarial learning. Front. Artif. Intell. 5, 825565 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ghoshal, B. & Tucker, A. Estimating uncertainty and interpretability in deep learning for coronavirus (COVID-19) detection. Preprint at arXiv:2003.10769 (2020).

  11. Ozturk, T. et al. Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 121, 103792 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Brunese, L., Mercaldo, F., Reginelli, A. & Santone, A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput. Methods Programs Biomed. 196, 105608 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Karim, M. et al. DeepCOVIDExplainer: explainable COVID-19 diagnosis from chest X-ray images. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1034–1037 (IEEE, 2020).

  14. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).

    Article  Google Scholar 

  15. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26, 900–908 (2020).

    Article  CAS  PubMed  Google Scholar 

  17. Han, S. S. et al. Augmented intellignece dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J. Invest. Dermatol. 140, 1753–1761 (2020).

    Article  CAS  PubMed  Google Scholar 

  18. Sun, M. D. et al. Accuracy of commercially available smartphone applications for the detection of melanoma. Br. J. Dermatol. 186, 744–746 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Freeman, K. et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. Br. Med. J. 368, m127 (2020).

    Article  Google Scholar 

  20. Beltrami, E. J. et al. Artificial intelligence in the detection of skin cancer. J. Am. Acad. Dermatol. 87, 1336–1342 (2022).

    Article  PubMed  Google Scholar 

  21. Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Han, S. S. et al. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J. Invest. Dermatol. 138, 1529–1538 (2018).

    Article  CAS  PubMed  Google Scholar 

  23. Ha, Q., Liu, B. & Liu, F. Identifying melanoma images using EfficientNet ensemble: winning solution to the SIIM-ISIC melanoma classification challenge. Preprint at arXiv:2010.05351 (2020).

  24. Rotemberg, V. et al. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data 8, 34 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Combalia, M. et al. BCN20000: dermoscopic lesions in the wild. Preprint at arXiv:1908.02288 (2019).

  27. Groh, M. et al. Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In Proceedings of the Computer Vision and Pattern Recognition (CVPR) Sixth ISIC Skin Image Analysis Workshop (IEEE, 2021).

  28. Karras, T. et al. Analyzing and improving the image quality of StyleGAN. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8107–8116 (IEEE, 2020).

  29. Shi, K. et al. A retrospective cohort study of the diagnostic value of different subtypes of atypical pigment network on dermoscopy. J. Am. Acad. Dermatol. 83, 1028–1034 (2020).

    Article  PubMed  Google Scholar 

  30. Yélamos, O. et al. Usefulness of dermoscopy to improve the clinical and histopathologic diagnosis of skin cancers. J. Am. Acad. Dermatol. 80, 365–377 (2019).

    Article  PubMed  Google Scholar 

  31. Halpern, A. C., Marghoob, A. A. & Reiter, O. Melanoma Warning Signs: What You Need to Know About Early Signs of Skin Cancer (Skin Cancer Foundation, 2021); https://www.skincancer.org/skin-cancer-information/melanoma/melanoma-warningsigns-and-images/. Accessed April 2023.

  32. Massi, D., De Giorgi, V., Carli, P. & Santucci, M. Diagnostic significance of the blue hue in dermoscopy of melanocytic lesions: a dermoscopic-pathologic study. Am. J. Dermatopathol. 23, 463–469 (2001).

    Article  CAS  PubMed  Google Scholar 

  33. Marghoob, N. G., Liopyris, K. & Jaimes, N. Dermoscopy: a review of the structures that facilitate melanoma detection. J. Osteopath. Med. 119, 380–390 (2019).

    Article  Google Scholar 

  34. Oliveria, S. A., Saraiya, M., Geller, A. C., Heneghan, M. K. & Jorgensen, C. Sun exposure and risk of melanoma. Arch. Dis. Child. 91, 131–138 (2006).

    Article  CAS  PubMed  Google Scholar 

  35. Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) 2223–2232 (IEEE, 2017).

  36. Illumination, I. C. on. ISO/CIE 11664-5:2016(e) Colorimetry—part 5: CIE 1976 L*u*v* colour space and u’, v’ uniform chromaticity scale diagram (2016).

  37. Deng, Z., Gijsenij, A. & Zhang, J. Source camera identification using auto-white balance approximation. In 2011 IEEE International Conference on Computer Vision 57–64 (IEEE, 2011).

  38. Rader, R. K. et al. The pink rim sign: location of pink as an indicator of melanoma in dermoscopic images. J. Skin Cancer 2014, 719740 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).

    Article  CAS  PubMed  Google Scholar 

  40. Tschandl, P. et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based international, diagnostic study. Lancet Oncol. 20, 938–947 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Weber, P., Sinz, C., Rinner, C., Kittler, H. & Tschandl, P. Perilesional sun damage as a diagnostic clue for pigmented actinic keratosis and Bowen’s disease. J. Eur. Acad. Dermatol. Venereol. 35, 2022–2026 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Fitzpatrick, J. E., High, W. A. & Kyle, W. L. Urgent Care Dermatology: Symptom-Based Diagnosis. 477–488 (Elsevier, 2018).

  43. Wu, E. et al. Toward Stronger FDA Approval Standards for AI Medical Devices (Stanford University Human-centered Artificial Intelligence (2022).

  44. Bansal, G. et al. Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (ACM, 2021).

  45. Rok, R. & Weld, D. S. In search of verifiability: explanations rarely enable complementary performance in AI-advised decision making. Preprint at arXiv:2305.07722v3 (2023).

  46. Roth, L. Looking at Shirley, the ultimate norm: colour balance, image technologies, and cognitive equity. Can. J. Commun. 34, 111–136 (2009).

    Article  Google Scholar 

  47. Lester, J. C., Clark, L., Linos, E. & Daneshjou, R. Clinical photography in skin of colour: tips and best practices. Br. J. Dermatol. 184, 1177–1179 (2021).

    Article  CAS  PubMed  Google Scholar 

  48. Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2, 158–164 (2018).

    Article  PubMed  Google Scholar 

  49. Yamashita, T. et al. Factors in color fundus photographs that can be used by humans to determine sex of individuals. Transl Vis. Sci. Technol. 9, 4 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Codella, N. C. F. et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI), 168–172 (IEEE, 2018).

  51. Tan, M. et al. MnasNet: platform-aware neural architecture search for mobile. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2820–2828 (IEEE, 2019).

  52. Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2704–2713 (IEEE, 2018)

  53. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  54. Tan, M. & Le, Q. EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019) 6105–6114 (PMLR, 2019).

  55. Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 7132–7141 (IEEE, 2018).

  56. Zhang, H. et al. ResNeSt: split-attention networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2735–2745 (IEEE, 2022).

  57. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2826 (IEEE, 2016).

  58. Giotis, I. et al. MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images. Expert Syst. Appl. 42, 6578–6585 (2015).

    Article  Google Scholar 

Download references

Acknowledgements

A.J.D., J.D.J., and S.-I.L. were supported by the National Science Foundation (CAREER DBI-1552309 and DBI-1759487) and the National Institutes of Health (R35 GM 128638 and R01 AG061132). R.D. was supported by the National Institutes of Health (5T32 AR007422-38) and the Stanford Catalyst Program.

Author information

Authors and Affiliations

Authors

Contributions

A.J.D., J.D.J., R.D. and S.-I.L. conceived the initial study. A.J.D. prepared data and developed software for the reproduction of dermatology AI classifiers, for their counterfactual analysis and for confirmatory experiments. A.J.D. and J.D.J. developed software for the generation of saliency maps. Z.R.C. and R.D. analysed counterfactual images and examined saliency maps. A.J.D., Z.R.C., J.D.J., R.D. and S.-I.L. analysed data and designed additional experiments. Z.R.C. and R.D. provided dermatological insights and clinical context. A.J.D., Z.R.C., J.D.J., R.D., and S.-I.L. wrote the manuscript. S.-I.L. secured funding, and R.D. and S.-I.L. jointly supervised the study.

Corresponding authors

Correspondence to Roxana Daneshjou or Su-In Lee.

Ethics declarations

Competing interests

R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA and VisualDx for consulting; stock options from MDAcne and Revea for advisory board; and research funding from UCB. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of insights from counterfactuals and saliency maps.

We calculated feature attributions using three popular techniques, Expected Gradients, Kernel SHAP, and GradCAM (see Supplementary Methods) and then produced our best-effort visualizations of the resulting saliency maps. We failed to gather insights from the saliency maps, except that the AI classifier may focus on the lesion (but perhaps not always, depending on the saliency technique). In contrast, the counterfactuals provided more granular and medically interpretable insights: for instance, based on the malignant counterfactuals we inferred that multiple colors of pigment (top + bottom), erythema (middle + bottom), darker pigmentation (all), and blue-white veil (bottom) tend to elicit more malignant predictions. In this figure, all saliency maps and counterfactuals were generated in reference to our AI classifier ‘SIIM-ISIC’. Figure adapted with permission from ref. 25, ViDIR Group, Department of Dermatology, Medical University of Vienna.

Extended Data Fig. 2 Attributes identified by the joint expert–XAI auditing procedure as key influences on the output of individual dermatology AI classifiers, when evaluated on the ISIC dataset.

In contrast to main text Fig. 2, attributes are ordered by the proportion of counterfactual pairs from the specified AI classifier in which experts noted that attribute differs, enabling examination of attributes relevant to a particular AI classifier but not necessarily to most AI classifiers (for example, prominence of skin grooves or dermatoglyphs, which influences Scanoma and ModelDerm).

Extended Data Fig. 3 Attributes identified by the join expert–XAI auditing procedure as key influences on the output of individual dermatology AI classifiers, when evaluated on the Fitzpatrick17k dataset.

In contrast to main text Fig. 2, attributes are ordered by the proportion of counterfactual pairs from the specified AI classifier in which experts noted that attribute differs, enabling examination of attributes relevant to a particular AI classifier but not necessarily to other AI classifiers.

Extended Data Fig. 4 Analysis of inter-reader variability, displaying the two readers’ individual conclusions side-by-side for each attribute.

For each reader, we separately determine whether that attribute was 'predominant' in benign or malignant counterfactuals, that is, present to a greater extent in benign (malignant) counterfactuals in at least twice as many images as malignant (benign) counterfactuals. The size of each rectangle (the 'fraction of counterfactual pairs') is then determined as the proportion of counterfactual pairs with a difference noted in the predominant direction, for that reader alone. While readers typically do not attain quantitative agreement on the fraction of counterfactual pairs for a given attribute, the presence and direction of an attribute’s effect typically remains consistent. For conciseness, attribute names are shortened as described in Supplementary Table 1.

Extended Data Fig. 5 Effect of the programmatic modification of image brightness on the predictions of the AI classifier.

We separately applied three methods of image brightness modification (see Supplementary Methods), then calculated the mean change in AI classifier output relative to the original, unaltered images. For modifications in linear RGB or Jzazbz space, we modified brightness by applying a multiplicative factor B = 2n; we display AI classifier responses as a function of n. For modifications in CIELUV space, we add a constant ΔL* to the perceptual lightness L*, where the maximum value of L* is 100. To facilitate visualization, the vertical axis is normalized to the maximum absolute change in AI classifier output observed for a given method; the normalization factors are displayed at bottom right. Images indicate the effect of each given brightness modification.

Extended Data Fig. 6 Effect of the programmatic modification of image chromaticity on the predictions of the AI classifier.

We separately applied three methods of image chromaticity modification (see Supplementary Methods), then calculated the mean change in AI classifier output relative to the original, unaltered images. Each method of chromaticity modification reflects the chromatic adaptation transform (white balancing method) provided by the corresponding color appearance model (CIE 1976 L* u* v*, CIE 1976 L* a* b*, or CAM16). To facilitate visualization, the vertical axis is normalized to the maximum absolute change in AI classifier output observed for a given method; the normalization factors are displayed at bottom right. Images indicate the effect of each given chromaticity modification. Color bars indicate the hue to which a neutral color (white) is shifted by the chromaticity modification; colorfulness in the color bar (but not example images) is exaggerated for ease of viewing.

Supplementary Information

Supplementary Information

Supplementary methods, Tables 1–4, Figs. 1–9 and references.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

DeGrave, A.J., Cai, Z.R., Janizek, J.D. et al. Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians. Nat. Biomed. Eng (2023). https://doi.org/10.1038/s41551-023-01160-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41551-023-01160-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing