Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Abstract

Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5–15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC’s performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: CoDoC training and deployment architecture.
Fig. 2: Performance of CoDoC in breast cancer prediction compared to that of a standalone predictive AI system and clinical readers.
Fig. 3: Performance of CoDoC in breast cancer prediction on a US mammography dataset (US Mammography Dataset 2).
Fig. 4: Performance of CoDoC in TB prediction.

Similar content being viewed by others

Data availability

The mammography datasets from Northwestern Medicine and St. Clair Hospital were used under licenses for the current study and are not publicly available. The tuberculosis datasets from the Stop TB Partnership and icddr,b were used under a license for the current study and are not publicly available. US Mammography Dataset 2 can be requested via email to k.j.geras@nyu.edu for research purposes, and access shall be granted within a week’s time. The Github link where the code is also hosted contains details on how to obtain access to the data required to reproduce results for the UK mammography dataset and the associated timeline. These datasets only consist of the data required to train CoDoC (a database of predictive AI confidence scores, clinician opinion and ground truth disease label for each case in the tuning/validation/test set). For the UK Mammography Dataset, the images and data used in this publication are derived from the OPTIMAM database (https://pubs.rsna.org/doi/abs/10.1148/ryai.2020200103?journalCode=ai), the creation of which was funded by Cancer Research UK. The full database including medical images used to train the predictive AI can be requested at https://medphys.royalsurrey.nhs.uk/omidb/getting-access/; this request will be reviewed by the OPTIMAM steering committee (https://medphys.royalsurrey.nhs.uk/omidb/the-steering-committee/).

Code availability

The code is available at https://github.com/deepmind/codoc.

References

  1. Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).

    Article  PubMed Central  Google Scholar 

  2. McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).

    Article  CAS  PubMed  Google Scholar 

  3. Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).

    Article  PubMed  Google Scholar 

  5. Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).

  7. Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).

    Google Scholar 

  9. Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).

    Article  CAS  PubMed  Google Scholar 

  10. Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).

  11. Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).

  12. Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).

  13. Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).

    PubMed  Google Scholar 

  15. Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).

  17. European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).

  18. UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).

  19. Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).

  20. Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).

  21. Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).

  22. Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).

  23. Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).

  25. Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).

    Article  Google Scholar 

  26. Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).

    Article  Google Scholar 

  27. Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).

  28. Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).

    Google Scholar 

  29. Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).

  30. McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).

    Article  CAS  PubMed  Google Scholar 

  31. Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).

    Article  PubMed  Google Scholar 

  32. Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).

  34. Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212

  35. Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).

  36. Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).

  37. Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).

  38. Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).

  39. Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the multiple contributors to this international project: Stop TB Partnership hosted by UNOPS; Cancer Research UK; the OPTIMAM project team and staff at the Royal Surrey County Hospital, who developed the UK Mammography OPTIMAM imaging database; our collaborators at Northwestern Medicine and all members of the Etemadi research group for their continued support of this work; St. Clair Hospital; and B.A. Klepchick, J.M. Andrus, R.J. Schaeffer and J.T. Sullivan. We thank the National Cancer Institute (NCI) for access to NCI data collected by the National Lung Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by the NCI. We also thank L. Peng, D. Webster, U. Telang and D. Belgrave for their valuable feedback and support throughout the course of this project; D. Tran, N. de Freitas and K. Kavukcuoglu for critically reading the manuscript and providing feedback; R. Pilgrim, A. Kiani and J. Rizk for work on partnership formation and engagement; R. May and E. Sutherland Robson for assistance with project coordination; S. Baur and S. Prabhakara for mammography domain expertise; and M. Wilson for early engineering work. The work by S.M., M.B. and N.P. was done at Google DeepMind/Google Research.

Author information

Authors and Affiliations

Authors

Contributions

K.D., J. Winkens, S.G., N.P., R.S., Y.B., P.K., T.C. and A. Karthikesalingam contributed to study conception and design. J. Witowski, S.M., S.S., M.S., T.S., G.C. and A. Karthikesalingam contributed to data acquisition; K.D., J. Winkens, M.B., S.G., N.P., R.S., M.D., T.S. and T.C. contributed to data analysis; K.D., J. Winkens, M.B., S.G., R.S., C.K., S.M., Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to data interpretation; K.D., J. Winkens, S.G., M.B., N.P., R.S., S.A., L.C., M.D., J.F., A. Kiraly, T.K., S.M., B.M., V.N., S.S., M.S. and T.C. contributed to the creation of new software used in this study; K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., C.K., A. Kiraly, Z.Z.Q., J.C., K.G., J. Witowski, P.K., T.C. and A. Karthikesalingam contributed to drafting and revising the manuscript; and K.D., J. Winkens, M.B., S.G., R.S., P.S., Z.A., P.K., T.C. and A. Karthikesalingam contributed to paper organization and team logistics.

Corresponding authors

Correspondence to Krishnamurthy (Dj) Dvijotham or Jim Winkens.

Ethics declarations

Competing interests

This study was funded by Google LLC and/or a subsidiary thereof (‘Google’). K.D., J. Winkens, S.G., R.S., P.S., Z.A., S.A., Y.B., L.C., M.D., J.F., C.K., A. Kiraly, T.K., B.M., V.N., S.S., M.S., T.S., G.C., P.K., T.C. and A. Karthikesalingam are employees of Google and own stock as part of the standard compensation package. S.M., M.B. and N.P. are previous employees of Google, N.P. is a current employee of Microsoft and S.M. is a current employee of OpenAI. Z.Z.Q. and J.C. are employees of the Stop TB Partnership and collaborated with Google to support this research effort. K.G. and J. Witowski are employees of the NYU Grossman School of Medicine. K.G. and J. Witowski collaborated with Google to support this research effort.

Peer review

Peer review information

Nature Medicine thanks Pranav Rajpurkar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Joao Monteiro and Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Additional information on datasets and images from breast examinations.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dvijotham, K.(., Winkens, J., Barsbey, M. et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nat Med 29, 1814–1820 (2023). https://doi.org/10.1038/s41591-023-02437-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-023-02437-x

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing