Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Dvijotham, Krishnamurthy (Dj); Winkens, Jim; Barsbey, Melih; Ghaisas, Sumedh; Stanforth, Robert; Pawlowski, Nick; Strachan, Patricia; Ahmed, Zahra; Azizi, Shekoofeh; Bachrach, Yoram; Culp, Laura; Daswani, Mayank; Freyberg, Jan; Kelly, Christopher; Kiraly, Atilla; Kohlberger, Timo; McKinney, Scott; Mustafa, Basil; Natarajan, Vivek; Geras, Krzysztof; Witowski, Jan; Qin, Zhi Zhen; Creswell, Jacob; Shetty, Shravya; Sieniek, Marcin; Spitz, Terry; Corrado, Greg; Kohli, Pushmeet; Cemgil, Taylan; Karthikesalingam, Alan

doi:10.1038/s41591-023-02437-x

Article
Published: 17 July 2023

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Nature Medicine volume 29, pages 1814–1820 (2023)Cite this article

7269 Accesses
13 Citations
132 Altmetric
Metrics details

Subjects

Abstract

Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5–15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC’s performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: CoDoC training and deployment architecture.**

**Fig. 2: Performance of CoDoC in breast cancer prediction compared to that of a standalone predictive AI system and clinical readers.**

**Fig. 3: Performance of CoDoC in breast cancer prediction on a US mammography dataset (US Mammography Dataset 2).**

**Fig. 4: Performance of CoDoC in TB prediction.**

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

Article Open access 16 April 2024

Segment anything in medical images

Article Open access 22 January 2024

Demographic bias in misdiagnosis by computational pathology models

Article 19 April 2024

Data availability

The mammography datasets from Northwestern Medicine and St. Clair Hospital were used under licenses for the current study and are not publicly available. The tuberculosis datasets from the Stop TB Partnership and icddr,b were used under a license for the current study and are not publicly available. US Mammography Dataset 2 can be requested via email to k.j.geras@nyu.edu for research purposes, and access shall be granted within a week’s time. The Github link where the code is also hosted contains details on how to obtain access to the data required to reproduce results for the UK mammography dataset and the associated timeline. These datasets only consist of the data required to train CoDoC (a database of predictive AI confidence scores, clinician opinion and ground truth disease label for each case in the tuning/validation/test set). For the UK Mammography Dataset, the images and data used in this publication are derived from the OPTIMAM database (https://pubs.rsna.org/doi/abs/10.1148/ryai.2020200103?journalCode=ai), the creation of which was funded by Cancer Research UK. The full database including medical images used to train the predictive AI can be requested at https://medphys.royalsurrey.nhs.uk/omidb/getting-access/; this request will be reviewed by the OPTIMAM steering committee (https://medphys.royalsurrey.nhs.uk/omidb/the-steering-committee/).

Code availability

The code is available at https://github.com/deepmind/codoc.

References

Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).
Article PubMed Central Google Scholar
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Article CAS PubMed Google Scholar
Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).
Article PubMed PubMed Central Google Scholar
Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).
Article PubMed Google Scholar
Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).
Article CAS PubMed Google Scholar
Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).
Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).
Article CAS PubMed PubMed Central Google Scholar
D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Google Scholar
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
Article CAS PubMed Google Scholar
Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).
Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).
Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).
Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).
Article CAS PubMed PubMed Central Google Scholar
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
PubMed Google Scholar
Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).
Article PubMed PubMed Central Google Scholar
Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).
European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).
UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).
Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).
Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).
Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).
Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
Article PubMed PubMed Central Google Scholar
Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).
Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).
Article Google Scholar
Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).
Article Google Scholar
Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).
Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).
Google Scholar
Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).
Article CAS PubMed Google Scholar
Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).
Article PubMed Google Scholar
Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).
Article CAS PubMed PubMed Central Google Scholar
Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).
Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212
Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).
Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).
Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).
Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).
Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to acknowledge the multiple contributors to this international project: Stop TB Partnership hosted by UNOPS; Cancer Research UK; the OPTIMAM project team and staff at the Royal Surrey County Hospital, who developed the UK Mammography OPTIMAM imaging database; our collaborators at Northwestern Medicine and all members of the Etemadi research group for their continued support of this work; St. Clair Hospital; and B.A. Klepchick, J.M. Andrus, R.J. Schaeffer and J.T. Sullivan. We thank the National Cancer Institute (NCI) for access to NCI data collected by the National Lung Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by the NCI. We also thank L. Peng, D. Webster, U. Telang and D. Belgrave for their valuable feedback and support throughout the course of this project; D. Tran, N. de Freitas and K. Kavukcuoglu for critically reading the manuscript and providing feedback; R. Pilgrim, A. Kiani and J. Rizk for work on partnership formation and engagement; R. May and E. Sutherland Robson for assistance with project coordination; S. Baur and S. Prabhakara for mammography domain expertise; and M. Wilson for early engineering work. The work by S.M., M.B. and N.P. was done at Google DeepMind/Google Research.

Author information

These authors contributed equally: Krishnamurthy (Dj) Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Taylan Cemgil, Alan Karthikesalingam.

Authors and Affiliations

Google DeepMind, Mountain View, CA, USA
Krishnamurthy (Dj) Dvijotham
Google Research, New York, NY, USA
Jim Winkens
Bogazici University, Istanbul, Turkey
Melih Barsbey
Google DeepMind, London, UK
Sumedh Ghaisas, Robert Stanforth, Zahra Ahmed, Yoram Bachrach, Pushmeet Kohli & Taylan Cemgil
Microsoft Research, Cambridge, UK
Nick Pawlowski
Google Research, London, UK
Patricia Strachan, Mayank Daswani, Jan Freyberg, Christopher Kelly, Terry Spitz & Alan Karthikesalingam
Google DeepMind, Toronto, Ontario, Canada
Shekoofeh Azizi & Laura Culp
Google Research, Palo Alto, CA, USA
Atilla Kiraly, Timo Kohlberger, Vivek Natarajan, Shravya Shetty, Marcin Sieniek & Greg Corrado
OpenAI, San Francisco, CA, USA
Scott McKinney
Google DeepMind, Zurich, Switzerland
Basil Mustafa
NYU Grossman School of Medicine, New York, NY, USA
Krzysztof Geras & Jan Witowski
Stop TB Partnership, Geneva, Switzerland
Zhi Zhen Qin & Jacob Creswell

Authors

Krishnamurthy (Dj) Dvijotham
View author publications
You can also search for this author in PubMed Google Scholar
Jim Winkens
View author publications
You can also search for this author in PubMed Google Scholar
Melih Barsbey
View author publications
You can also search for this author in PubMed Google Scholar
Sumedh Ghaisas
View author publications
You can also search for this author in PubMed Google Scholar
Robert Stanforth
View author publications
You can also search for this author in PubMed Google Scholar
Nick Pawlowski
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Strachan
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Shekoofeh Azizi
View author publications
You can also search for this author in PubMed Google Scholar
Yoram Bachrach
View author publications
You can also search for this author in PubMed Google Scholar
Laura Culp
View author publications
You can also search for this author in PubMed Google Scholar
Mayank Daswani
View author publications
You can also search for this author in PubMed Google Scholar
Jan Freyberg
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Kelly
View author publications
You can also search for this author in PubMed Google Scholar
Atilla Kiraly
View author publications
You can also search for this author in PubMed Google Scholar
Timo Kohlberger
View author publications
You can also search for this author in PubMed Google Scholar
Scott McKinney
View author publications
You can also search for this author in PubMed Google Scholar
Basil Mustafa
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Geras
View author publications
You can also search for this author in PubMed Google Scholar
Jan Witowski
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Zhen Qin
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Creswell
View author publications
You can also search for this author in PubMed Google Scholar
Shravya Shetty
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Sieniek
View author publications
You can also search for this author in PubMed Google Scholar
Terry Spitz
View author publications
You can also search for this author in PubMed Google Scholar
Greg Corrado
View author publications
You can also search for this author in PubMed Google Scholar
Pushmeet Kohli
View author publications
You can also search for this author in PubMed Google Scholar
Taylan Cemgil
View author publications
You can also search for this author in PubMed Google Scholar
Alan Karthikesalingam
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.D., J. Winkens, S.G., N.P., R.S., Y.B., P.K., T.C. and A. Karthikesalingam contributed to study conception and design. J. Witowski, S.M., S.S., M.S., T.S., G.C. and A. Karthikesalingam contributed to data acquisition; K.D., J. Winkens, M.B., S.G., N.P., R.S., M.D., T.S. and T.C. contributed to data analysis; K.D., J. Winkens, M.B., S.G., R.S., C.K., S.M., Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to data interpretation; K.D., J. Winkens, S.G., M.B., N.P., R.S., S.A., L.C., M.D., J.F., A. Kiraly, T.K., S.M., B.M., V.N., S.S., M.S. and T.C. contributed to the creation of new software used in this study; K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., C.K., A. Kiraly, Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to drafting and revising the manuscript; and K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., P.K., T.C. and A. Karthikesalingam contributed to paper organization and team logistics.

Corresponding authors

Correspondence to Krishnamurthy (Dj) Dvijotham or Jim Winkens.

Ethics declarations

Competing interests

This study was funded by Google LLC and/or a subsidiary thereof (‘Google’). K.D., J. Winkens, S.G., R.S., P.S., Z.A., S.A., Y.B., L.C., M.D., J.F., C.K., A. Kiraly, T.K., B.M., V.N., S.S., M.S., T.S., G.C., P.K., T.C. and A. Karthikesalingam are employees of Google and own stock as part of the standard compensation package. S.M., M.B. and N.P. are previous employees of Google, N.P. is a current employee of Microsoft and S.M. is a current employee of OpenAI. Z.Z.Q. and J.C. are employees of the Stop TB Partnership and collaborated with Google to support this research effort. K.G. and J. Witowski are employees of the NYU Grossman School of Medicine. K.G. and J. Witowski collaborated with Google to support this research effort.

Peer review

Peer review information

Nature Medicine thanks Pranav Rajpurkar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Joao Monteiro and Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Additional information on datasets and images from breast examinations.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dvijotham, K.(., Winkens, J., Barsbey, M. et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nat Med 29, 1814–1820 (2023). https://doi.org/10.1038/s41591-023-02437-x

Download citation

Received: 02 November 2022
Accepted: 05 June 2023
Published: 17 July 2023
Issue Date: July 2023
DOI: https://doi.org/10.1038/s41591-023-02437-x

This article is cited by

Clinical data mining: challenges, opportunities, and recommendations for translational applications
- Huimin Qiao
- Yijing Chen
- You Guo
Journal of Translational Medicine (2024)
Deep learning-aided decision support for diagnosis of skin disease across skin tones
- Matthew Groh
- Omar Badri
- Rosalind Picard
Nature Medicine (2024)
Heterogeneity and predictors of the effects of AI assistance on radiologists
- Feiyang Yu
- Alex Moehring
- Pranav Rajpurkar
Nature Medicine (2024)
“Metabolic fingerprints” of cachexia in lung cancer patients
- Armin Frille
- Jann Arends
- Thomas Beyer
European Journal of Nuclear Medicine and Molecular Imaging (2024)
Balancing human and AI roles in clinical imaging
- Fiona Gilbert
Nature Medicine (2023)

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Subjects

Abstract

Access options

Similar content being viewed by others

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

Segment anything in medical images

Demographic bias in misdiagnosis by computational pathology models

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Clinical data mining: challenges, opportunities, and recommendations for translational applications

Deep learning-aided decision support for diagnosis of skin disease across skin tones

Heterogeneity and predictors of the effects of AI assistance on radiologists

“Metabolic fingerprints” of cachexia in lung cancer patients

Balancing human and AI roles in clinical imaging

Balancing human and AI roles in clinical imaging

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links